Any UDFs for cardinality estimation out there?

Aniket_Schneide · August 2013

We use Vertica frequently for counting things. Sometimes we need an exact count, but often a little bit of error would be fine. I'm curious whether anyone has experimented with UDFs using algorithms like KMV or HyperLogLog for cardinality estimation in Vertica, and if the code is available anywhere. I'm hoping this would be useful for cutting down on memory usage when querying projections that have been optimized for a different purpose.

[Deleted User] · August 2013

Hi Aniket. I'm very familiar with both KMV and HLL. What are your accuracy and memory requirements?

[Deleted User] · August 2013

...and which version of Vertica are you running? (p.s. are you the person that I was speaking to at the Big Data / User conference about this topic?)

Aniket_Schneide · August 2013

I was not at the conference, though it's good to know there are other people interested! In terms of accuracy I haven't developed firm requirements, but I think something in the range of 0.1% would be reasonable for our application. In terms of memory I am primarily interested in avoiding a GROUP BY hash table spilling to disk, so up to 1-2 GB would be reasonable (my impression is that much less than this should be possible?)

[Deleted User] · December 2013

Aniket, I didn't see an email notifying me that you had responded, so I'm sorry I'm getting back to you only now. However, it may have been worth the wait, because I have some excellent news for you:

In version 7.0.0, we have implemented new aggregate functions that do approximate count distinct. You may specify the accuracy that you need, and if you need 0.1% accuracy 97% of the time, you can just specify that. By default, the accuracy is +/- 1%, 97% of the time. Note that a 97% confidence level corresponds to 2.17 standard deviations.

You're right about the space: for 0.1% accuracy HLL "only" needs about 6Mb per aggregate in memory. If you can get by with +/- 5%, 97% of the time (or what is loosely called "typically 2% accuracy" in the HLL paper), then you need a mere 1.5kB per aggregate. The default accuracy is 1%, which requires about 50kB per aggregate.

It only gets better from there (much better!), but I'll save some of that for an upcoming blog on the topic.

Download version 7.0.0 (or later) and give it a try. I'd love to hear what you think.

/Jim

We're Moving!

Create My New Community Account Now

Any UDFs for cardinality estimation out there?

Comments

Leave a Comment