We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Sampling data from a large dataset using TABLESAMPLE — Vertica Forum

Sampling data from a large dataset using TABLESAMPLE

This blog post was authored by Jim Knicely.

A development or quality assurance team often times requests access to a sub-set of production data. One way to do that would be to make use of the LIMIT clause.

Example:

dbadmin=> SELECT COUNT(*) FROM big_number_table;
   COUNT
------------
1000000000
(1 row)

dbadmin=> SELECT 0.05*1000000000 "5_percent";
  5_percent
-------------
50000000.00
(1 row)

dbadmin=> CREATE TABLE big_number_table_5_pct AS SELECT * FROM big_number_table LIMIT 50000000;
CREATE TABLE
Time: First fetch (0 rows): 13057.251 ms. All rows formatted: 13057.281 ms

The problem with that method is the team won’t get a very good sampling of the data. That is, they’ll be getting the first 5% of the rows that Vertica retrieves from the table.

In order to provide a true sampling of the entire data set, Vertica provides the TABLESAMPLE clause.

Example:

dbadmin=> CREATE TABLE big_number_table_5_better_pct AS SELECT * FROM big_number_table TABLESAMPLE(25);
CREATE TABLE
Time: First fetch (0 rows): 93910.763 ms. All rows formatted: 93910.795 ms

It takes a little longer to build the subset of data, but it will be an accurate representation of the production data!

Have fun!

Sign In or Register to comment.