Sampling data from a large dataset using TABLESAMPLE

This blog post was authored by Jim Knicely.

A development or quality assurance team often times requests access to a sub-set of production data. One way to do that would be to make use of the LIMIT clause.

Example:

dbadmin=> SELECT COUNT(*) FROM big_number_table;
   COUNT
------------
1000000000
(1 row)

dbadmin=> SELECT 0.05*1000000000 "5_percent";
  5_percent
-------------
50000000.00
(1 row)

dbadmin=> CREATE TABLE big_number_table_5_pct AS SELECT * FROM big_number_table LIMIT 50000000;
CREATE TABLE
Time: First fetch (0 rows): 13057.251 ms. All rows formatted: 13057.281 ms

The problem with that method is the team won’t get a very good sampling of the data. That is, they’ll be getting the first 5% of the rows that Vertica retrieves from the table.

In order to provide a true sampling of the entire data set, Vertica provides the TABLESAMPLE clause.

Example:

dbadmin=> CREATE TABLE big_number_table_5_better_pct AS SELECT * FROM big_number_table TABLESAMPLE(25);
CREATE TABLE
Time: First fetch (0 rows): 93910.763 ms. All rows formatted: 93910.795 ms

It takes a little longer to build the subset of data, but it will be an accurate representation of the production data!

Have fun!

Sign In or Register to comment.