Please take this survey to help us learn more about how you use third party tools. Your input is greatly appreciated!

Query performance

I have 1 crore data

user_testing=> select count(*) from likes_test ;
  count   
----------
 10000001
(1 row)

Time: First fetch (1 row): 8.088 ms. All rows formatted: 8.116 ms

The data contains like this

 [
{"id":1,
"likes":[401,218,134,941,565,457,864,346,774,921,766,432,815,63,822,202,679,409,457,743,902,199,717,232,172,958,179,808,271,418,145,949,806,399,68,935,423,287,31,365,68,875,347,786,386,792,620,903,136,67,539,196,866,419,390,612,591,881,68,575,226,291,354,380,889,105,412,511,411,861,61,802,388,168,655,296,43,541,824,756,649,2,204,396,231,499,42,773,918,19,229,716,675,762,939,352,35,949,445,787,215,524,998,596,267,444,690,52,762]}
]

Count Query on likes, it takes 29 seconds

#

 user_testing=> select count(id) from likes_test where mapcontainsvalue(likes,'2')='t' or mapcontainsvalue(likes,'4');
  count  
---------
 1968654
(1 row)

Time: First fetch (1 row): 29661.954 ms. All rows formatted: 26661.985 ms

Then I decreased interval of AnalyzeRowCount service,Default value is 60

    user_testing=> ALTER DATABASE db_testing_0 SET AnalyzeRowCountInterval = 10;
        ALTER DATABASE
        Time: First fetch (0 rows): 6.891 ms. All rows formatted: 6.902 ms

Tried the same query, it takes 27 seconds

user_testing=> select count(id) from likes_test where mapcontainsvalue(likes,'2')='t' or mapcontainsvalue(likes,'4');
  count  
---------
 1968654
(1 row)

Time: First fetch (1 row): 27388.772 ms. All rows formatted: 27388.800 ms

Then I created projection and tried the same query it takes 26 seconds

#

 user_testing=> select count(id) from likes_test_p where mapcontainsvalue(likes,'2')='t' or mapcontainsvalue(likes,'4');
  count  
---------
 1968654
(1 row)

Time: First fetch (1 row): 26661.954 ms. All rows formatted: 26661.985 ms

What I have to do to get counts in less time, Provide solution :(
Thanks in Advance

Comments

  • edited January 2018

    Hi!

    Flex tables in Vertica aren't created for performance, but for unstructured/semi-structured data. Following to your example, your data can be represented as structured data, so do unpivot to your data and create a regular table, i.e.:

     id  |   likes
    -----+--------
        1| 201
        1| 315
     ...
    

    Segment your table + create good projection/s (ORDER BY id, likes).

    PS: Vertica is RDBMS and not Document Oriented DBMS.

  • Jim_KnicelyJim_Knicely Administrator
    edited January 2018

    AnalyzeRowCount is only used to update row count statistics. It has nothing to do with running a COUNT query.

    Is the "likes_test" table a FLEX table? These are not built for performance. You will get MUCH better performance if you materialize the columns in your WHERE clause predicates.

    Oops. Just saw the post by @sKwa! He is correct and I would go with his suggestion.

  • You can unroll the likes array with mapvalues(). Something like:
    CREATE FLEX LOCAL TEMP TABLE staging(id int) segmented by hash(id) all nodes;
    COPY staging FROM '...' PARSER fjsonparser();
    INSERT INTO likes SELECT id, mapvalues(likes) OVER (partition by id) FROM staging;

  • Jim_KnicelyJim_Knicely Administrator
    edited January 2018

    Great idea Ben_Vandiver

    Complete example (with one record):

    dbadmin=> CREATE FLEX LOCAL TEMP TABLE staging(id int) ON COMMIT PRESERVE ROWS SEGMENTED BY HASH(id) ALL NODES;
    CREATE TABLE
    
    dbadmin=> \! cat /home/dbadmin/likes.json
    [
    {"id":1,
    "likes":[401,218,134,941,565,457,864,346,774,921,766,432,815,63,822,202,679,409,457,743,902,199,717,232,172,958,179,808,271,418,145,949,806,399,68,935,423,287,31,365,68,875,347,786,386,792,620,903,136,67,539,196,866,419,390,612,591,881,68,575,226,291,354,380,889,105,412,511,411,861,61,802,388,168,655,296,43,541,824,756,649,2,204,396,231,499,42,773,918,19,229,716,675,762,939,352,35,949,445,787,215,524,998,596,267,444,690,52,762]}
    ]
    
    dbadmin=> COPY staging FROM '/home/dbadmin/likes.json' PARSER fjsonparser();
     Rows Loaded
    -------------
               1
    (1 row)
    
    dbadmin=> CREATE TABLE likes (id INT, values INT) ORDER BY id, values SEGMENTED BY HASH(id) ALL NODES;
    CREATE TABLE
    
    dbadmin=> INSERT INTO likes SELECT id, values::int FROM (SELECT id, mapvalues(likes) OVER (partition by id) FROM staging) foo;
     OUTPUT
    --------
        109
    
    (1 row)
    
    dbadmin=> SELECT COUNT(DISTINCT id) FROM likes WHERE values IN (2, 4);
     COUNT
    -------
         1
    (1 row)
    
  • I think that needs to be a COUNT(DISTINCT ID) to reflect the original intention.

  • Jim_KnicelyJim_Knicely Administrator

    I corrected the complete example :D

  • hello @Jim_Knicely @Ben_Vandiver
    I followed what you have said, this is my likes data
    subscriber_data=> select count(*) from likes_proj;commit;
    count
    ------------
    1382890288
    (1 row)

        Time: First fetch (1 row): 154.654 ms. All rows formatted: 154.717 ms
    

    And I need to join this likes_proj with another subscriber_8 table count
    subscriber_data=> select count(*) from subscriber_8;commit;
    count
    ----------
    57548349
    (1 row)

        Time: First fetch (1 row): 7245.067 ms. All rows formatted: 7245.099 ms
    

    I queried like this it takes 1 minute 35 seconds
    subscriber_data=> select count(distinct s1.id) from likes_proj s1,projection_base_subscriber_8 s2 where esp=1 and un=0 and s1.id=s2.id and values in (4,5,6,8,133,699,999);commit;
    count
    ---------
    1959111
    (1 row)

            Time: First fetch (1 row): 95715.810 ms. All rows formatted: 95715.841 ms
    

    Can anyone suggest me to get fast counts?

  • Jim_KnicelyJim_Knicely Administrator
    edited January 2018

    Hi,

    I wouldn't join projections together. You should join tables.

    Can you post the DDL for the tables? I'm sure the projections can be tuned for that query. In fact, why not run database designer on it? It's there to do the hard work for you!

    See:
    https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/AdministratorsGuide/ConfiguringTheDB/PhysicalSchema/DBD/DesignQueries.htm

    Make sure you've analyzed stats on the tables.

    See:
    https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Statistics/CollectingDatabaseStatistics.htm

    Try to use the newer ANSI SQL-92 join syntax. I find that it performs much better in Vertica!

    For example:

    select count(distinct s1.id)
    from likes_table s1
    join projection_base_subscriber_8 s2
    on s1.id = s2.id
    where esp=1 
    and un=0
    and values in (4,5,6,8,133,699,999);
    

    See:
    https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/AnalyzingData/Queries/Joins/JoinSyntax.htm

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file

Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.