Table has day, campaign_id and various data columns.
If the table has NULL value in data columns, does that effect on query performance/speed,( like query need to filter on campaign_id in last n days) ?
Not for a columnar DB like Vertica (See attachment). But be careful when segmenting on columns that have alot of NULLs as that data may skew to a particular node.
No. the NULL value , for querying, filtering, sorting, is to be preferred.
I always say - with one of my buddies - that the introduction of the NULL value , with the meaning "I don't know", in relational databases, was practically as important as the introduction of the number 0 by the Arabs (the Romans did not have it): it opened a completely new way of calculating or evaluating truth or falseness of a statement.
On "Do not use columns with a lot of NULL values to segment data": If you have 4 nodes; and 4 million rows in a table; and segment by , say, vendor_id; and vendor_id is NULL in 50% of the cases, and, otherwise, evenly distributed.
Then, you can expect to have 500,000 rows each on 3 nodes, and 2,500,000 rows on the one node that is assigned to the NULL value.
You will have, to over-simplify, all 4 nodes busy until the first 500,000 rows are processed, and then three nodes turning their thumbs while waiting for the fourth to process 2 million more rows all by itself.
Thanks. Then If the column has same value (lke substitute NULL as "unknown") a lot, does the segmentation not to skew to a particular node (we are creating projections based on query to evenly distribute data across nodes); in other words, do I need to remove/substitute the NULL in table in order to have it to improve querying data out of table more efficiently (again the query is to get all data columns filtered on a/list campaign_id within last n days) ?
Like do not use columns with lot null to segment data?
Alright that helpful. Thanks
Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.