The Vertica Forum recently got a makeover! Let us know what you think by filling out this short, anonymous survey.
Please take this survey to help us learn more about how you use third party tools. Your input is greatly appreciated!

Sampling data from a large dataset using TABLESAMPLE

This blog post was authored by Jim Knicely.

A development or quality assurance team often times requests access to a sub-set of production data. One way to do that would be to make use of the LIMIT clause.


dbadmin=> SELECT COUNT(*) FROM big_number_table;
(1 row)

dbadmin=> SELECT 0.05*1000000000 "5_percent";
(1 row)

dbadmin=> CREATE TABLE big_number_table_5_pct AS SELECT * FROM big_number_table LIMIT 50000000;
Time: First fetch (0 rows): 13057.251 ms. All rows formatted: 13057.281 ms

The problem with that method is the team won’t get a very good sampling of the data. That is, they’ll be getting the first 5% of the rows that Vertica retrieves from the table.

In order to provide a true sampling of the entire data set, Vertica provides the TABLESAMPLE clause.


dbadmin=> CREATE TABLE big_number_table_5_better_pct AS SELECT * FROM big_number_table TABLESAMPLE(25);
Time: First fetch (0 rows): 93910.763 ms. All rows formatted: 93910.795 ms

It takes a little longer to build the subset of data, but it will be an accurate representation of the production data!

Have fun!

Sign In or Register to comment.