[UDF] Custom intermediate values in User-Defined Aggregate Functions?

Alexey__Kryuchk · September 2013

Hello, Currently Vertica::IntermediateAggs class seems to be the only way to pass data between multiple instances of an aggregate function. This works fine when data is reduced to single scalar values during aggregation. However, consider an aggregate function which has a complex state --- for example, a matrix of arbitrary dimensions, or a non-trivial C++ data structure such as std::map. Looks like it's impossible to pass such state in an IntermediateAggs instance. Is there any way to write UDAFs with complex state / intermediate values that I am missing? To provide some context on what I'm trying to do: I'm attempting to port a Linear Probabilistic Counter (concisely explained in this gist) to Vertica in a form of an aggregate function. This function's intermediate value is a bitset, which I can't store in an IntermediateAggs instance. Computing the element count from bitset at the end of aggregate() and storing it in IntermediateAggs is not an option, since it severely skews the result, in the worst case multiplying it by the number of sub-aggregation runs; so, this must be done only once in terminate().

[Deleted User] · September 2013

Hi Alexey, Speaking to your specific question: Are you aware of Vertica's BINARY and VARBINARY types? We are actually able to represent bitsets natively; up to 65000 bytes (520000 bits) per VARBINARY field, roughly 32MB per row total across all columns, in the current version of Vertica. As to your more-general question: Many common data structures can be represented as collections of scalars. Say you're implementing an integer top-K algorithm with a heap, for example. A heap is just an ordered collection of integers; the provided APIs certainly know how to ship integers around. There's a substantial performance advantage to giving Vertica as much information as you can about the structure of your data. We've worked hard (and are continuing to work hard) to optimize how we ship your data around; the more we know, the smarter we can be. But if you're really working with some structure that we just don't understand yet, well, all data is just bytes in the end, and we do provide a VARBINARY type; you can just serialize your structure into a VARBINARY and call it a day. Vertica doesn't provide serialization APIs for you currently. This is largely because there are many high-quality open-source libraries that already provide serialization functionality. For example, if you want something simple, Boost Serialization provides built-in serializers for all of the primitive and STL types in C++, including std::map. ("Why serialization, rather than just passing objects around?" Recall that the different stages of aggregation often don't run on the same computer. So we have to ship your intermediates over the network somehow.) Adam

Alexey__Kryuchk · September 2013

Hi Adam, Thanks for the quick reply. VARBINARY will certainly work for me. Two additional questions: 1.Am I correct in assuming that [VAR]BINARY is internally represented by VString class? (Not that there are any other options, but SDK docs talk about character encoding, which confuses me: does VString attempt to do any encoding/decoding itself, or is it really a "dumb" container for bytes?) 2. Is 32MB/row (512 columns?) a hard limit in the current version of Vertica? Are such limits documented anywhere? (I must have missed the page while reading the documentation...)

[Deleted User] · September 2013

Hi Alexey, No problem. In answer: 1. Yep, you're correct here. Sorry for the confusing documentation. VString is, as you hypothesize, a "dumb" container for bytes. You can actually take a look at the full implementation of VString if you are so inclined and would like specific details (as it's included in the SDK) . 2. I'm unfortunately not sure the best way to find this page without specifically searching for it, but our documentation has a page entitled "System Limits": https://my.vertica.com/docs/6.1.x/HTML/index.htm#10538.htm It's actually either 1,600 columns or 32mb, whichever is smaller. (Technically, it's actually 32,768,000 bytes, not 32mb; it also includes a few bytes of overhead per variable-width column to track the data size. I don't immediately see this in the documentation; I'll see if we can't clarify it there.) Both limits are because, as a compressed column store, we typically load multiple records at once and we require some system resources for each column in a query. So as you approach either of these limits, particularly if you're running more-complex queries, you may find Vertica bumping into the limits of your system's resources. Adam

Alexey__Kryuchk · September 2013

Okay, this answers all the questions I had. Thanks Adam!

We're Moving!

Create My New Community Account Now

[UDF] Custom intermediate values in User-Defined Aggregate Functions?

Comments

Leave a Comment