Options

slow APPROXIMATE_COUNT_DISTINCT() with group by

I'm testing APPROXIMATE_COUNT_DISTINCT() function. When I run a query on single table like:
select APPROXIMATE_COUNT_DISTINCT(col1) from big_table;
it's fine and I get results much faster than by calling
select count(distinct col1) from big_table;
But when I use group by statement like:
select col2, APPROXIMATE_COUNT_DISTINCT(col1) from big_table group by col2;
I get results much slower (eg 10 times) than by using 
select col2, count(distinct col1) from big_table group by col2;
Then only solution is to use error_tolerance = 5 or greater. Then I get results very fast. Cardinality of col2 is low (20 records in my test case).

It's frustrating. Is this a feature or bug?

Comments

  • Options
    Hi!

    When your table is optimized for count distinct so use in count distinct. APPROX should help with "dark data".
    Say you did a big load/insert/update and more than 10% of all data are changed, so statistics is broken. Or say you have no statistics (and it isn't sorted) on particular column, so exact function COUNT DISTINCT need to do a full scan, while APPROX will not.

    PS
    it isn't official answer and im wrong
  • Options
    I can image that count distinct could be a little bit faster than any estimate when data are ordered. Anyway this time gap should be marginal. When data aren't ordered the right way then there is no reason for an estimate to be slower. What I observe:
    • APPROXIMATE_COUNT_DISTINCT( column , 5 ) - 60 sec.
    • COUNT(DISTINCT column) - 90 sec.
    • APPROXIMATE_COUNT_DISTINCT( column , 4 ) - 600 sec.
  • Options
    Hi Antoin,

    Actually, it turns out that there is a reason for the approximate form to be slower in certain cases:

    APPROXIMATE_COUNT_DISTINCT() uses a statistical algorithm that must take a random sample of your data with a certain number of data points, with a certain distribution, etc; then perform some (basic) computation on each sampled value.

    If you have a large data set composed of random values, this will of course be much faster than computing on the entire data set.

    But consider the following scenario:
    - Most values in the data set are duplicates
    - Data is stored sorted
    - Data is stored encoded with RLE encoding

    In this scenario, because of the sorting and the RLE encoding, Vertica actually stores the distinct data values pre-computed on disk, because that's what naturally happens when you RLE sorted data.

    So, COUNT(DISTINCT ...) doesn't require un-encoding the data.  In fact, it doesn't even care what the values are, as long as it can prove "well, this block of data must have at least N distinct values".  Depending on exactly what your data looks like on disk (cardinality, mergeout, etc), it almost gets changed from a COUNT() of data to a SUM() on metadata.

    But APPROXIMATE_COUNT_DISTINCT() does need the actual data values.  So it must do somewhat more decoding work.

    You might ask "well, why would we ever use the approximate algorithm if the brute-force one has access to more information so would be faster?"  The answer to that question would be, "well, how do we know for sure that it's faster?"  As you can see, sometimes the approximate algorithm is still faster.  It is a sampling-based approach; if it can sample few enough data values, it may still win out.  The only way to get a reasonable estimate of the performance of the query would be to know how many distinct values are in the data set.  Which, well, if we knew that, we'd be done.

    (There are fancier ways to try to predict performance and select the right algorithm.  We don't do that yet :-)  In general, any attempt to predict performance and guess the right algorithm for a particular data set without first considering the data set, runs the risk of getting it wrong sometimes.)

    More generally -- APPROXIMATE_COUNT_DISTINCT() does perform considerably more logic per row considered, than does COUNT(DISTINCT ...).  It hopes to be faster anyway by considering many fewer rows.  But the number of rows considered is governed by the parameters that you pass into the function.  If you manage to force it to consider every single row, it will certainly be slower than regular COUNT(DISTINCT ...); it's considering the same data set but performing more logic per row considered.  (You might ask "why don't you automatically switch to COUNT (DISTINCT ...) in this case?"  We would ask in response "why are you running APPROXIMATE_COUNT_DISTINCT() when you don't want an approximate answer?"  It'd be an odd request; maybe you'd have a reason...)  So, draw out those two lines on a graph; APPROXIMATE_COUNT_DISTINCT() performance and COUNT(DISTINCT ...) performance.  The latter as a flat line indicating its runtime; the former as a curve with runtime as a function of error_tolerance.  At no/zero error tolerance, trying to approximate the result will always be slower.  At high error tolerance, it should be faster.  So the two lines (curves) must always cross somewhere.  For you, on your data set, they happen to cross at 5.

    Adam
  • Options
    Just to clarify -- you would need to set both the expected number of errors and the confidence level in order to actually force the algorithm to attempt a result that truly has 0 error.  I'm not sure that we enable this in the released version of the function.  So it is possible that you'll never see the worse-than-COUNT(DISTINCT ...) end of the above curve, by adjusting the values that you are able to control.  But that's just a matter of what your data looks like.
  • Options
    Also, sorry, to correct -- the algorithm isn't actually using a sampling approach.  But the exact details of what's going on would be a whole separate question.  The overall concept is similar -- if you ask for more accuracy, this algorithm requires more resources; eventually this means that it will be slower.

    Let me know if this makes sense.
  • Options
    Hi Adam,

    thanks for your time. I think I get your points but I still think something is going wrong (or could be better). I guess APPROXIMATE_COUNT_DISTINCT uses king of HyperLogLog algorithm and I think I am familiar with its basic concept. 

    I've written a simple test cases. First 3 test cases gives expected results (and demonstrates that count distinct can be significantly faster than approximate count distinct) but the 4th test case gives odd results and I think it might show there is a bug that creates significant slow down.

    The test is done on a single node (k-safe=0).
    --create table  create table test_count_distinct as   select       round(RANDOM()*10,0) gb_col,    round(RANDOM()*1000000,0) count_col  from    big_table  limit 1000000000;  --580sec    --check existing projections  select * from PROJECTIONS where projection_basename='test_count_distinct';  select * from PROJECTION_COLUMNS where projection_name='test_count_distinct_super';    --get source code of the only projection  select EXPORT_OBJECTS('','test_count_distinct_super',0);    --source:  CREATE PROJECTION public.test_count_distinct_super /*+basename(test_count_distinct),createtype(A)*/   (   gb_col,   count_col  )  AS   SELECT test_count_distinct.gb_col,          test_count_distinct.count_col   FROM public.test_count_distinct   ORDER BY test_count_distinct.gb_col,            test_count_distinct.count_col  SEGMENTED BY hash(test_count_distinct.gb_col, test_count_distinct.count_col) ALL NODES ;  --table is ordered, should perform well in group by queries
    --check number of rows  select count(1) from test_count_distinct; --1000000000rows    --test case 1) count distinct without group by  select count(distinct count_col) from test_count_distinct; --9sec  select APPROXIMATE_COUNT_DISTINCT(count_col,5) from test_count_distinct; --4sec      select APPROXIMATE_COUNT_DISTINCT(count_col,4) from test_count_distinct; --4sec      select APPROXIMATE_COUNT_DISTINCT(count_col) from test_count_distinct; --4sec --results as expected    --test case 2) count distinct with group by  select gb_col, count(distinct count_col) from test_count_distinct group by gb_col; --3sec  select gb_col, APPROXIMATE_COUNT_DISTINCT(count_col,5) from test_count_distinct group by gb_col; --49sec      select gb_col, APPROXIMATE_COUNT_DISTINCT(count_col,4) from test_count_distinct group by gb_col; --55sec      select gb_col, APPROXIMATE_COUNT_DISTINCT(count_col) from test_count_distinct group by gb_col; --55sec  --results as expected, count(distinct exp) is much faster because of ordered table (well, just is't ok, that it's nearly 15 times faster? Doesn't this correlate with cardinality of gb_col?)    ------------------------------    drop table test_count_distinct;    --repeate the same test case but add a column that causes disorder of other data  create table test_count_distinct as   select       round(RANDOM()*10000000000,0) disorder_col,    round(RANDOM()*10,0) gb_col,    round(RANDOM()*1000000,0) count_col  from    big ta  limit 1000000000;  --1031sec.    --check projections again  select * from PROJECTIONS where projection_basename='test_count_distinct';  select * from PROJECTION_COLUMNS where projection_name='test_count_distinct_super';    --get projection source code  select EXPORT_OBJECTS('','test_count_distinct_super',0);    --the projection source code:  CREATE PROJECTION public.test_count_distinct_super /*+basename(test_count_distinct),createtype(A)*/   (   disorder_col,   gb_col,   count_col  )  AS   SELECT test_count_distinct.disorder_col,          test_count_distinct.gb_col,          test_count_distinct.count_col   FROM public.test_count_distinct   ORDER BY test_count_distinct.disorder_col,            test_count_distinct.gb_col,            test_count_distinct.count_col  SEGMENTED BY hash(test_count_distinct.disorder_col, test_count_distinct.gb_col, test_count_distinct.count_col) ALL NODES ;    --check number of rows  select count(1) from test_count_distinct; --1000000000    --test case 3:  select count(distinct count_col) from test_count_distinct; --125sec  select APPROXIMATE_COUNT_DISTINCT(count_col,5) from test_count_distinct; --7sec      select APPROXIMATE_COUNT_DISTINCT(count_col,4) from test_count_distinct; --7sec      select APPROXIMATE_COUNT_DISTINCT(count_col) from test_count_distinct; --7sec --expected results, estimate is much faster    --test case 4)  select gb_col, count(distinct count_col) from test_count_distinct group by gb_col; --468sec  select gb_col, APPROXIMATE_COUNT_DISTINCT(count_col,5) from test_count_distinct group by gb_col; --14sec      select gb_col, APPROXIMATE_COUNT_DISTINCT(count_col,4) from test_count_distinct group by gb_col; --1570sec  !!!!  select gb_col, APPROXIMATE_COUNT_DISTINCT(count_col) from test_count_distinct group by gb_col; --1569sec !!!!  --observing problem    drop table test_count_distinct;
  • Options
    Hi Antonin,

    For the plans that seem to get much slower -- how much memory is Vertica using to execute those?  And what does the query plan (with EXPLAIN) look like?

    You're making quite a lot of groups there.  And greater precision requires more resources per group.  I'm wondering if the query is deciding, based on your resource pool settings, that it doesn't have enough memory to keep all of the groupings in memory simultaneously.  Which is, of course, one way to cause a query to slow down due to lack of resources.

    Adam
  • Options
    Hi Adam,

    I tried to assign all available memory to vertica - resource usages shows 109GB assigned to the single query. The test results are the same though.

    There is only 11 groups.

    Explained plans:
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------   ------------------------------   QUERY PLAN DESCRIPTION:   ------------------------------     explain select gb_col, APPROXIMATE_COUNT_DISTINCT(count_col,5) from test_count_distinct group by gb_col;     Access Path:   +-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 2M, Rows: 10K (NO STATISTICS)] (PATH ID: 1)   |  Aggregates: ApproxCountDistinct5PCT(CASE WHEN (test_count_distinct.count_col IS NULL) THEN (-1) ELSE murmurhash(test_count_distinct.count_col) END)   |  Group By: test_count_distinct.gb_col   | +---> STORAGE ACCESS for test_count_distinct [Cost: 2M, Rows: 1B (NO STATISTICS)] (PATH ID: 2)   | |      Projection: public.test_count_distinct_super   | |      Materialize: test_count_distinct.gb_col, test_count_distinct.count_col  
    and 
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------   ------------------------------   QUERY PLAN DESCRIPTION:   ------------------------------     explain select gb_col, APPROXIMATE_COUNT_DISTINCT(count_col,4) from test_count_distinct group by gb_col;     Access Path:   +-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 2M, Rows: 10K (NO STATISTICS)] (PATH ID: 1)   |  Aggregates: ApproxCountDistinct(CASE WHEN (test_count_distinct.count_col IS NULL) THEN (-1) ELSE murmurhash(test_count_distinct.count_col) END)   |  Group By: test_count_distinct.gb_col   | +---> STORAGE ACCESS for test_count_distinct [Cost: 2M, Rows: 1B (NO STATISTICS)] (PATH ID: 2)   | |      Projection: public.test_count_distinct_super   | |      Materialize: test_count_distinct.gb_col, test_count_distinct.count_col  
    I noticed the only difference is in function name ApproxCountDistinct5PCT vs. ApproxCountDistinct. There might be a different implementation...

    I tried to profile these queries. Disk reads 134MB. ApproxCountDistinct5PCT uses 320MB of Memory, ApproxCountDistinct uses 2,5GB.

    Have you been able to replicate the scenario?

    Antonin


Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file