Sampling data from a large dataset using TABLESAMPLE
This blog post was authored by Jim Knicely.
A development or quality assurance team often times requests access to a sub-set of production data. One way to do that would be to make use of the LIMIT clause.
dbadmin=> SELECT COUNT(*) FROM big_number_table; COUNT ------------ 1000000000 (1 row) dbadmin=> SELECT 0.05*1000000000 "5_percent"; 5_percent ------------- 50000000.00 (1 row) dbadmin=> CREATE TABLE big_number_table_5_pct AS SELECT * FROM big_number_table LIMIT 50000000; CREATE TABLE Time: First fetch (0 rows): 13057.251 ms. All rows formatted: 13057.281 ms
The problem with that method is the team won’t get a very good sampling of the data. That is, they’ll be getting the first 5% of the rows that Vertica retrieves from the table.
In order to provide a true sampling of the entire data set, Vertica provides the TABLESAMPLE clause.
dbadmin=> CREATE TABLE big_number_table_5_better_pct AS SELECT * FROM big_number_table TABLESAMPLE(25); CREATE TABLE Time: First fetch (0 rows): 93910.763 ms. All rows formatted: 93910.795 ms
It takes a little longer to build the subset of data, but it will be an accurate representation of the production data!