slow APPROXIMATE_COUNT_DISTINCT() with group by
I'm testing APPROXIMATE_COUNT_DISTINCT() function. When I run a query on single table like:
It's frustrating. Is this a feature or bug?
select APPROXIMATE_COUNT_DISTINCT(col1) from big_table;it's fine and I get results much faster than by calling
select count(distinct col1) from big_table;But when I use group by statement like:
select col2, APPROXIMATE_COUNT_DISTINCT(col1) from big_table group by col2;I get results much slower (eg 10 times) than by using
select col2, count(distinct col1) from big_table group by col2;Then only solution is to use error_tolerance = 5 or greater. Then I get results very fast. Cardinality of col2 is low (20 records in my test case).
It's frustrating. Is this a feature or bug?
0
Comments
When your table is optimized for count distinct so use in count distinct. APPROX should help with "dark data".
Say you did a big load/insert/update and more than 10% of all data are changed, so statistics is broken. Or say you have no statistics (and it isn't sorted) on particular column, so exact function COUNT DISTINCT need to do a full scan, while APPROX will not.
PS
it isn't official answer and im wrong
Actually, it turns out that there is a reason for the approximate form to be slower in certain cases:
APPROXIMATE_COUNT_DISTINCT() uses a statistical algorithm that must take a random sample of your data with a certain number of data points, with a certain distribution, etc; then perform some (basic) computation on each sampled value.
If you have a large data set composed of random values, this will of course be much faster than computing on the entire data set.
But consider the following scenario:
- Most values in the data set are duplicates
- Data is stored sorted
- Data is stored encoded with RLE encoding
In this scenario, because of the sorting and the RLE encoding, Vertica actually stores the distinct data values pre-computed on disk, because that's what naturally happens when you RLE sorted data.
So, COUNT(DISTINCT ...) doesn't require un-encoding the data. In fact, it doesn't even care what the values are, as long as it can prove "well, this block of data must have at least N distinct values". Depending on exactly what your data looks like on disk (cardinality, mergeout, etc), it almost gets changed from a COUNT() of data to a SUM() on metadata.
But APPROXIMATE_COUNT_DISTINCT() does need the actual data values. So it must do somewhat more decoding work.
You might ask "well, why would we ever use the approximate algorithm if the brute-force one has access to more information so would be faster?" The answer to that question would be, "well, how do we know for sure that it's faster?" As you can see, sometimes the approximate algorithm is still faster. It is a sampling-based approach; if it can sample few enough data values, it may still win out. The only way to get a reasonable estimate of the performance of the query would be to know how many distinct values are in the data set. Which, well, if we knew that, we'd be done.
(There are fancier ways to try to predict performance and select the right algorithm. We don't do that yet :-) In general, any attempt to predict performance and guess the right algorithm for a particular data set without first considering the data set, runs the risk of getting it wrong sometimes.)
More generally -- APPROXIMATE_COUNT_DISTINCT() does perform considerably more logic per row considered, than does COUNT(DISTINCT ...). It hopes to be faster anyway by considering many fewer rows. But the number of rows considered is governed by the parameters that you pass into the function. If you manage to force it to consider every single row, it will certainly be slower than regular COUNT(DISTINCT ...); it's considering the same data set but performing more logic per row considered. (You might ask "why don't you automatically switch to COUNT (DISTINCT ...) in this case?" We would ask in response "why are you running APPROXIMATE_COUNT_DISTINCT() when you don't want an approximate answer?" It'd be an odd request; maybe you'd have a reason...) So, draw out those two lines on a graph; APPROXIMATE_COUNT_DISTINCT() performance and COUNT(DISTINCT ...) performance. The latter as a flat line indicating its runtime; the former as a curve with runtime as a function of error_tolerance. At no/zero error tolerance, trying to approximate the result will always be slower. At high error tolerance, it should be faster. So the two lines (curves) must always cross somewhere. For you, on your data set, they happen to cross at 5.
Adam
Let me know if this makes sense.
thanks for your time. I think I get your points but I still think something is going wrong (or could be better). I guess APPROXIMATE_COUNT_DISTINCT uses king of HyperLogLog algorithm and I think I am familiar with its basic concept.
I've written a simple test cases. First 3 test cases gives expected results (and demonstrates that count distinct can be significantly faster than approximate count distinct) but the 4th test case gives odd results and I think it might show there is a bug that creates significant slow down.
The test is done on a single node (k-safe=0).
For the plans that seem to get much slower -- how much memory is Vertica using to execute those? And what does the query plan (with EXPLAIN) look like?
You're making quite a lot of groups there. And greater precision requires more resources per group. I'm wondering if the query is deciding, based on your resource pool settings, that it doesn't have enough memory to keep all of the groupings in memory simultaneously. Which is, of course, one way to cause a query to slow down due to lack of resources.
Adam
I tried to assign all available memory to vertica - resource usages shows 109GB assigned to the single query. The test results are the same though.
There is only 11 groups.
Explained plans: and I noticed the only difference is in function name ApproxCountDistinct5PCT vs. ApproxCountDistinct. There might be a different implementation...
I tried to profile these queries. Disk reads 134MB. ApproxCountDistinct5PCT uses 320MB of Memory, ApproxCountDistinct uses 2,5GB.
Have you been able to replicate the scenario?
Antonin