Sampling data from a large dataset using TABLESAMPLE

[Deleted User] · February 2018

This blog post was authored by Jim Knicely.

A development or quality assurance team often times requests access to a sub-set of production data. One way to do that would be to make use of the LIMIT clause.

Example:

dbadmin=> SELECT COUNT(*) FROM big_number_table;
   COUNT
------------
1000000000
(1 row)

dbadmin=> SELECT 0.05*1000000000 "5_percent";
  5_percent
-------------
50000000.00
(1 row)

dbadmin=> CREATE TABLE big_number_table_5_pct AS SELECT * FROM big_number_table LIMIT 50000000;
CREATE TABLE
Time: First fetch (0 rows): 13057.251 ms. All rows formatted: 13057.281 ms

The problem with that method is the team won’t get a very good sampling of the data. That is, they’ll be getting the first 5% of the rows that Vertica retrieves from the table.

In order to provide a true sampling of the entire data set, Vertica provides the TABLESAMPLE clause.

Example:

dbadmin=> CREATE TABLE big_number_table_5_better_pct AS SELECT * FROM big_number_table TABLESAMPLE(25);
CREATE TABLE
Time: First fetch (0 rows): 93910.763 ms. All rows formatted: 93910.795 ms

It takes a little longer to build the subset of data, but it will be an accurate representation of the production data!

Have fun!

We're Moving!

Create My New Community Account Now

Sampling data from a large dataset using TABLESAMPLE