Any UDFs for cardinality estimation out there?
We use Vertica frequently for counting things. Sometimes we need an exact count, but often a little bit of error would be fine. I'm curious whether anyone has experimented with UDFs using algorithms like KMV or HyperLogLog for cardinality estimation in Vertica, and if the code is available anywhere. I'm hoping this would be useful for cutting down on memory usage when querying projections that have been optimized for a different purpose.
0
Comments
In version 7.0.0, we have implemented new aggregate functions that do approximate count distinct. You may specify the accuracy that you need, and if you need 0.1% accuracy 97% of the time, you can just specify that. By default, the accuracy is +/- 1%, 97% of the time. Note that a 97% confidence level corresponds to 2.17 standard deviations.
You're right about the space: for 0.1% accuracy HLL "only" needs about 6Mb per aggregate in memory. If you can get by with +/- 5%, 97% of the time (or what is loosely called "typically 2% accuracy" in the HLL paper), then you need a mere 1.5kB per aggregate. The default accuracy is 1%, which requires about 50kB per aggregate.
It only gets better from there (much better!), but I'll save some of that for an upcoming blog on the topic.
Download version 7.0.0 (or later) and give it a try. I'd love to hear what you think.
/Jim