Please take this survey to help us learn more about how you use third party tools. Your input is greatly appreciated!

Any UDFs for cardinality estimation out there?

We use Vertica frequently for counting things. Sometimes we need an exact count, but often a little bit of error would be fine. I'm curious whether anyone has experimented with UDFs using algorithms like KMV or HyperLogLog for cardinality estimation in Vertica, and if the code is available anywhere. I'm hoping this would be useful for cutting down on memory usage when querying projections that have been optimized for a different purpose.

Comments

  • Hi Aniket. I'm very familiar with both KMV and HLL. What are your accuracy and memory requirements?
  • ...and which version of Vertica are you running? (p.s. are you the person that I was speaking to at the Big Data / User conference about this topic?)
  • I was not at the conference, though it's good to know there are other people interested! In terms of accuracy I haven't developed firm requirements, but I think something in the range of 0.1% would be reasonable for our application. In terms of memory I am primarily interested in avoiding a GROUP BY hash table spilling to disk, so up to 1-2 GB would be reasonable (my impression is that much less than this should be possible?)
  • Aniket, I didn't see an email notifying me that you had responded, so I'm sorry I'm getting back to you only now.  However, it may have been worth the wait, because I have some excellent news for you:

    In version 7.0.0, we have implemented new aggregate functions that do approximate count distinct.  You may specify the accuracy that you need, and if you need 0.1% accuracy 97% of the time, you can just specify that.  By default, the accuracy is +/- 1%, 97% of the time.  Note that a 97% confidence level corresponds to 2.17 standard deviations.

    You're right about the space: for 0.1% accuracy HLL "only" needs about 6Mb per aggregate in memory.  If you can get by with +/- 5%, 97% of the time (or what is loosely called "typically 2% accuracy" in the HLL paper), then you need a mere 1.5kB per aggregate.  The default accuracy is 1%, which requires about 50kB per aggregate.

    It only gets better from there (much better!), but I'll save some of that for an upcoming blog on the topic.

    Download version 7.0.0 (or later) and give it a try.  I'd love to hear what you think.

        /Jim

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file

Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.