Avoid Using Functions on Very Large Data Sets
This blog post was authored by Jim Knicely.
You can store billions and billions and billions (i.e., a lot) of records in your Vertica tables. When querying these large data sets, try to avoid using database functions like TO_DATE, TO_CHAR, NVL, etc. when unnecessary.
Example:
A table named BIG_DATE_TABLE has 1 billion rows and a column named THE_DATE, which stores a date value as an integer in the format YYYYMMDD.
dbadmin=> SELECT COUNT(*) FROM big_date_table; COUNT ------------ 1000000000 (1 row)
I want to know how many records there are in the table where THE_DATE column is equal to the current SYSDATE.
My first attempt at getting that answer is the following SQL statement:
dbadmin=> SELECT COUNT(*) FROM big_date_table WHERE TO_DATE(the_date::VARCHAR, 'YYYYMMDD') = TRUNC(SYSDATE); COUNT -------- 134976 (1 row) Time: First fetch (1 row): 78354.856 ms. All rows formatted: 78354.900 ms
That query took a whopping 80 seconds to run!
Fortunately I realized a better solution is to avoid using the TO_DATE function directly against the billion records!
dbadmin=> SELECT COUNT(*) FROM big_date_table WHERE the_date = TO_CHAR(SYSDATE, 'YYYYMMDD')::INT; COUNT -------- 134976 (1 row) Time: First fetch (1 row): 782.399 ms. All rows formatted: 782.440 ms
That’s better! Have Fun!