Query performance
Table has day, campaign_id and various data columns.
If the table has NULL value in data columns, does that effect on query performance/speed,( like query need to filter on campaign_id in last n days) ?
Best Answers
-
Jim_Knicely - Select Field - Administrator
Not for a columnar DB like Vertica (See attachment). But be careful when segmenting on columns that have alot of NULLs as that data may skew to a particular node.
0 -
marcothesane - Select Field - Administrator
No. the NULL value , for querying, filtering, sorting, is to be preferred.
I always say - with one of my buddies - that the introduction of the NULL value , with the meaning "I don't know", in relational databases, was practically as important as the introduction of the number 0 by the Arabs (the Romans did not have it): it opened a completely new way of calculating or evaluating truth or falseness of a statement.On "Do not use columns with a lot of NULL values to segment data": If you have 4 nodes; and 4 million rows in a table; and segment by , say,
vendor_id
; andvendor_id
is NULL in 50% of the cases, and, otherwise, evenly distributed.
Then, you can expect to have 500,000 rows each on 3 nodes, and 2,500,000 rows on the one node that is assigned to the NULL value.
You will have, to over-simplify, all 4 nodes busy until the first 500,000 rows are processed, and then three nodes turning their thumbs while waiting for the fourth to process 2 million more rows all by itself.0
Answers
Thanks. Then If the column has same value (lke substitute NULL as "unknown") a lot, does the segmentation not to skew to a particular node (we are creating projections based on query to evenly distribute data across nodes); in other words, do I need to remove/substitute the NULL in table in order to have it to improve querying data out of table more efficiently (again the query is to get all data columns filtered on a/list campaign_id within last n days) ?
Like do not use columns with lot null to segment data?
Alright that helpful. Thanks