statistics on topk projection

Hi,
I have a question regarding collecting of statistics.
It's possible to collect them on a topk projection? how?
I am using 8.0 version.

Tried with:
select analyze_histogram(projection_name,100);
select analyze_statistics(projection_name);

Even if the operations completed with 0, with success, the projections' columns appear having statistics_type = ROWCOUNT , and the projection having has_statistics = false.

My problem is related to the velocity of querying the projection. It seems querying the topk projection the response time is 10 times greater then querying the table itself. The topk projection is order by appropriated columns.
The explain plan shows a lower cost (1000 times lower) when the projection is used, but the reality is different.
Thank you,
Veronica

Comments

  • My experience is statistics are one of the keys to optimum query performance. This is the first time I've heard of issues around updating them for TopK. I'll run this by some technical people I'm friendly with and see what we can find out.

  • Thank you!

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Fyi... From the docs:

    Restrictions (of ANALYZE_STATISTICS) - Vertica collects no statistics on live aggregate and Top-K projections that are anchored to the specified table.

    See:
    https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_STATISTICS.htm

  • yes, I read the documentation.
    anyway, no error is thrown if I run

    select analyze_histogram(projection_name,100);
    select analyze_statistics(projection_name);

    Question: how to collect statistics for a certain projection ?

  • You can only run ANALYZE_STATISTICS against the entire database or a particular schema, table or column; it is not run against a projection.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Hi,

    Here is an example in 8.1.1:

    dbadmin=> select version();
                  version
    ------------------------------------
     Vertica Analytic Database v8.1.1-5
    (1 row)
    
    dbadmin=> create table test (c1 int, c2 int) order by c1;
    CREATE TABLE
    
    dbadmin=> create projection test_jim as select * from test order by c2;
    CREATE PROJECTION
    
    dbadmin=> create projection test_ap as select c1, sum(c2) sum_c2 from test group by c1;
    WARNING 6852:  Live Aggregate Projection "test_ap" will be created for "test". Data in "test" will be neither updated nor deleted
    CREATE PROJECTION
    
    dbadmin=> insert into test select 1, 1;
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> commit;
    COMMIT
    
    dbadmin=> select projection_schema, projection_name, has_statistics from projections where anchor_table_name = 'test';
     projection_schema | projection_name | has_statistics
    -------------------+-----------------+----------------
     public            | test_super      | f
     public            | test_jim        | f
     public            | test_ap         | f
    (3 rows)
    
    dbadmin=> select analyze_statistics('public.test_jim');
     analyze_statistics
    --------------------
                      0
    (1 row)
    
    dbadmin=> select projection_schema, projection_name, has_statistics from projections where anchor_table_name = 'test';
     projection_schema | projection_name | has_statistics
    -------------------+-----------------+----------------
     public            | test_super      | t
     public            | test_jim        | t
     public            | test_ap         | f
    (3 rows)
    
    dbadmin=> select analyze_statistics('public.test_super');
     analyze_statistics
    --------------------
                      0
    (1 row)
    
    dbadmin=> select projection_schema, projection_name, has_statistics from projections where anchor_table_name = 'test';
     projection_schema | projection_name | has_statistics
    -------------------+-----------------+----------------
     public            | test_super      | t
     public            | test_jim        | t
     public            | test_ap         | f
    (3 rows)
    
    dbadmin=> select analyze_statistics('public.test_ap');
     analyze_statistics
    --------------------
                      0
    (1 row)
    
    dbadmin=> select projection_schema, projection_name, has_statistics from projections where anchor_table_name = 'test';
     projection_schema | projection_name | has_statistics
    -------------------+-----------------+----------------
     public            | test_super      | t
     public            | test_jim        | t
     public            | test_ap         | f
    (3 rows)
    

    The only projection I can not get stats on is the aggregate projection ... Just like the doc says.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited September 2017

    Fyi ... You don't get an error when you run analyze statistics on a Live Aggregate Projection because it does gather a row count stat...

    From my example above:

    dbadmin=> select distinct projection_name, statistics_type, statistics_updated_timestamp from projection_columns where projection_name in ('test_super', 'test_jim', 'test_ap');
     projection_name | statistics_type | statistics_updated_timestamp
    -----------------+-----------------+-------------------------------
     test_jim        | FULL            | 2017-09-20 10:16:36.069871-04
     test_super      | FULL            | 2017-09-20 10:16:36.069871-04
     test_ap         | ROWCOUNT        | 2017-09-20 10:16:36.069871-04
    (3 rows)
    
  • Thank you very much,Jim!
    All of your explanations I discovered myself.

    So, your response to my question:
    It's possible to collect them on a topk projection? how?
    is : NO, isn't possible to have statistics for a topk projection . True?

    How fast does topk projections respond without statistics ?

    I will try to come back with example.

  • I have a table with 1300 columns that stores data for 1290 sensors every 10 minutes, 10 columns being used to store location and time for every measurement.
    From time to time a correction is received for a certain location/time. I need to consider only the last data/correction stored. For this reason I thought a topk projection can help.

    CREATE TABLE measure
    (
    obj_position_id int NOT NULL,
    obj_id int NOT NULL,
    date timestamp NOT NULL,
    year_nbr int NOT NULL,
    month_nbr int NOT NULL ,
    day_nbr int NOT NULL,
    hour_nbr int NOT NULL ,
    "offset" numeric(4,2) NOT NULL,
    deleted int NOT NULL,
    created timestamp NOT NULL,
    measure_1 numeric(8,5),
    ...............
    measure_1290 numeric(8,5)
    )
    order by obj_position_id, created;

    create projection measure_last (
    grouped(
    year_nbr,
    obj_id ,
    obj_position_id,
    created,
    date ,
    month_nbr ,
    day_nbr ,
    hour_nbr ,
    "offset" ,
    deleted
    ),
    measure_1 ,
    ...............
    measure_1290 ,
    )
    select
    year_nbr,
    obj_id ,
    obj_position_id,
    created,
    date ,
    month_nbr ,
    day_nbr ,
    hour_nbr ,
    "offset" ,
    deleted
    measure_1 ,
    ...............
    measure_1290 ,
    from measure
    order by year_nbr, obj_id, obj_position_id, created;

    create projection measure_topk (
    grouped(
    year_nbr,
    obj_id ,
    obj_position_id,
    created,
    date ,
    month_nbr ,
    day_nbr ,
    hour_nbr ,
    "offset" ,
    deleted
    ),
    measure_1 ,
    ...............
    measure_1290 ,
    )
    select
    year_nbr,
    obj_id ,
    obj_position_id,
    created,
    date ,
    month_nbr ,
    day_nbr ,
    hour_nbr ,
    "offset" ,
    deleted
    measure_1 ,
    ...............
    measure_1290 ,
    from measure
    limit 1 over (partition by year_nbr, obj_id, obj_position_id order by created desc);

    select measure.obj_id, measure.year_nbr, count(1) cnt
    from measure_last measure
    where
    measure.year_nbr = 2016
    and measure.obj_id = 2
    and measure.obj_position_id >= 220160101000000
    and measure.obj_position_id < 220170101000000
    group by measure.year_nbr, measure.obj_id;
    -----response_time = 1sec

    select measure.obj_id, measure.year_nbr, count(1) cnt
    from measure_topk measure
    where
    measure.year_nbr = 2016
    and measure.obj_id = 2
    and measure.obj_position_id >= 220160101000000
    and measure.obj_position_id < 220170101000000
    group by measure.year_nbr, measure.obj_id;
    -----response_time = 10sec

  • -- select on normal projection
    select measure_last .obj_id, measure_last .year_nbr, count(1) cnt
    from measure_last
    where
    measure_last .year_nbr = 2016
    and measure_last .obj_id = 2
    and measure_last .obj_position_id >= 220160101000000
    and measure_last .obj_position_id < 220170101000000
    group by measure_last .year_nbr, measure_last .obj_id;
    -- response_time 1 sec

    Access Path:
    +-GROUPBY PIPELINED [Cost: 617, Rows: 1] (PATH ID: 1)
    | Aggregates: count(1)
    | Group By: measure_last.year_nbr, measure_last.obj_id
    | +---> STORAGE ACCESS for measure_last [Cost: 616, Rows: 641] (PATH ID: 2)
    | | Projection: measure_last
    | | Materialize: measure_last.year_nbr, measure_last.obj_id
    | | Filter: (measure_last.year_nbr = 2016)
    | | Filter: (measure_last.obj_id= 2)
    | | Filter: ((measure_last.obj_position_id >= 220160101000000) AND (measure_last.obj_position_id < 220170101000000))

    --select on topk projection
    select measure.obj_id, measure.year_nbr, count(1) cnt
    from measure_topk measure
    where
    measure.year_nbr = 2016
    and measure.obj_id= 2
    and measure.obj_position_id >= 220160101000000
    and measure.obj_position_id < 220170101000000
    group by measure.year_nbr, measure.obj_id;
    --response time - 10 sec

    Access Path:
    +-GROUPBY PIPELINED [Cost: 423, Rows: 500 (NO STATISTICS)] (PATH ID: 1)
    | Aggregates: count(1)
    | Group By: measure.year_nbr, measure.obj_id
    | +---> STORAGE ACCESS for measure[Cost: 384, Rows: 98K (NO STATISTICS)] (PATH ID: 2)
    | | Projection: measure_topk
    | | Materialize: measure.year_nbr, measure.obj_id
    | | Filter: (measure.year_nbr = 2016)
    | | Filter: (measure.obj_id= 2)
    | | Filter: ((measure.obj_position_id>= 220160101000000) AND (measure.obj_position_id< 220170101000000))

  • select count(1) from measure_last;
    97831

    select count(1) from measure_topk;
    97831

  • Even if the cost for topk projection is lower the response time is greater.
    The only difference that I could see is
    STORAGE ACCESS for measure[Cost: 384, Rows: 98K (NO STATISTICS)] (PATH ID: 2)
    Question:

    • there is a way to collect statistics ?
    • if there isn't, what is the goal of their usage ?
  • v_monitor.query_requests/
    --------------------request_duration_ms
    select measure_topk - 7941
    select measure_last - 68

    v_monitor.resource_acquisitions
    -----------------------open_file_handle_count,thread_count,memory_inuse_kb,duration_ms
    select measure_topk - 6 ,4 ,9365 ,4
    select measure_last - 8 ,4 ,9239 ,11

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited September 2017

    Hi!

    I think there is something wrong with your data or your TOPK projection. The TOPK projection typically has less rows than a superprojection because of LIMIT clause. But you said that both the measure_topk and measure_last projections both have 97,831 records. Maybe in your data there is only one record for each partition (year_nbr, obj_id and obj_position_id)? Should the partition clause in the TOPK projection be just on the year_nbr and obj_id columns?

    Example of a typical use case:

    dbadmin=> select export_objects('','test');
                                                                                                                                                                                                                                       export_objects                                                                           
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    
    CREATE TABLE public.test
    (
        c1 int,
        c2 int
    );
    
    CREATE PROJECTION public.test_super /*+basename(test),createtype(P)*/
    (
     c1,
     c2
    )
    AS
     SELECT test.c1,
            test.c2
     FROM public.test
     ORDER BY test.c1
    SEGMENTED BY hash(test.c1, test.c2) ALL NODES OFFSET 0;
    
    CREATE PROJECTION public.test_topk1
    (
     c1,
     c2
    )
    AS
     SELECT test.c1,
            test.c2
     FROM public.test
     LIMIT 1
     OVER (
     PARTITION BY  test.c1
     ORDER BY  test.c2)
    ;
    
    SELECT MARK_DESIGN_KSAFE(0);
    (1 row)
    
    dbadmin=> select count(*) from public.test_super;
       count
    -----------
     345292288
    (1 row)
    
    dbadmin=> select count(*) from public.test_topk1;
     count
    -------
         2
    (1 row)
    

    See the difference? 2 records vs 345,292,288!

    dbadmin=> \x
    Expanded display is on.
    
    dbadmin=> select c1, c2, count(1) from test_super group by c1, c2;
     c1 | c2 |  count
    ----+----+----------
      0 |  5 | 17262123
      1 |  2 | 17272385
      1 |  8 | 17260787
      0 |  0 | 17257688
      0 |  6 | 17265756
      1 |  3 | 17268274
      1 |  9 | 17270888
      0 |  1 | 17258749
      0 |  7 | 17263256
      1 |  4 | 17262658
      0 |  2 | 17268376
      0 |  8 | 17263304
      1 |  5 | 17259300
      0 |  3 | 17269025
      0 |  9 | 17264743
      1 |  0 | 17268420
      1 |  6 | 17265067
      0 |  4 | 17257533
      1 |  1 | 17268451
      1 |  7 | 17265505
    (20 rows)
    
    Time: First fetch (20 rows): 5218.275 ms. All rows formatted: 5218.354 ms
    
    dbadmin=> select c1, c2, count(1) from test_topk1 group by c1, c2;
     c1 | c2 | count
    ----+----+-------
      0 |  0 |     1
      1 |  0 |     1
    (2 rows)
    
    Time: First fetch (2 rows): 25.100 ms. All rows formatted: 25.149 ms
    

    See in my case, the TOPK is << 1 second...

  • hi,
    I understand your point of view, but is not my case.
    My table and topk projection have a similar number of records. Not the same, but similar.

    I have a table with 1300 columns that stores data for 1290 sensors every 10 minutes, 10 columns being used to store location and time for every measurement.
    From time to time a correction is received for a certain location/time. I need to consider only the last data/correction stored.

    For the same combination of columns : (year_nbr, obj_id, obj_position_id), I can have 1 or more records, with different 'created' values. Most of them have only 1 record. I am interested in seeing only the records with maximum 'created' value, the last inserted record for a certain (year_nbr, obj_id, obj_position_id). For this reason I thought of using topk projection.

    In a real situation :
    select count(1) from measure_last;
    50.000.000

    select count(1) from measure_topk;
    45.000.000

    Do you have another idea how to have only the last created records and statistics, also ?

    thank you!

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Hi,

    How often do you need to query the data that is NOT the last created records and statistics? One option could be to have a measure_hist table where you stick all of the old values. Then in the measure table, you only keep the latest. So when you do an UPDATE, you'd simply insert the current data in measure into measure_hist, then update the record in measure.

    Or instead of a TOPK projection, maybe you can create a better projection for a LIMIT 1 query? Something like this?

    create projection measure_pr (
    year_nbr, 
    obj_id , 
    obj_position_id,
    created,
    date , 
    month_nbr ,
    day_nbr ,
    hour_nbr ,
    "offset" ,
    deleted ,
    measure_1 ,
    ...
    measure_1290
    )
    as
    select 
    year_nbr, 
    obj_id , 
    obj_position_id,
    created,
    date , 
    month_nbr ,
    day_nbr ,
    hour_nbr ,
    "offset" ,
    deleted ,
    measure_1 ,
    ...
    measure_1290
    from measure
    order by year_nbr, obj_id, obj_position_id, created
    segmented by hash(year_nbr, obj_id, obj_position_id) all nodes;
    

    So a query to get the last created records and statistics might look like this:

    select measure.obj_id, measure.year_nbr cnt, measure_1, measure_1290
    from measure
    where
    measure .year_nbr = 2016 
    and measure .obj_id = 2 
    and measure.obj_position_id >= 220160101000000 
    and measure.obj_position_id < 220170101000000
    limit 1 over (partition by year_nbr, obj_id, obj_position_id order by created desc);
    
  • it's a good solution. thank you!
    I intend to use a flex table, because it allows me to use more columns, the forecast is to have 5000 columns. Flex tables can't be updated.

    ERROR: Unsupported access to flex table: No UPDATE support

  • Hi. I've been following your posts, and are you able to comment on the use case for the sensor data? What is it gathering every 10 minutes and what do you do with it? Why 1,300 columns, projected to go to 5,000 columns?

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited October 2017

    Fyi ... all tables in Vertica are limited to 1600 columns. That includes FLEX tables.

    Example:

    dbadmin=> \! head -5 /home/dbadmin/jim.txt
    create FLEX table super_wide_flex_table (
     col_1  int,
     col_2  int,
     col_3  int,
     col_4  int,
    
    dbadmin=> \! tail -5 /home/dbadmin/jim.txt
     col_1596  int,
     col_1597  int,
     col_1598  int,
     col_1599  int,
     col_1600  int); -- too many columns!
    
    dbadmin=> \i /home/dbadmin/jim.txt
    vsql:/home/dbadmin/jim.txt:1601: ROLLBACK 2106:  A table/projection/view can only have up to 1600 columns -- this create statement has 1601
    

    And yes, you cannot UPDATE a flex table, although you can INSERT:

    See:
    https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/FlexTables/InsertingDataIntoFlexTables.htm

    Note you can also DELETE from a FLEX table...

    Example:

    dbadmin=> create flex table jim(c1 int);
    CREATE TABLE
    
    dbadmin=> insert into jim select 1;
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> update jim set c1 = 2;
    ERROR 6092:  Unsupported access to flex table: No UPDATE support
    
    dbadmin=> select c1 from jim;
     c1
    ----
      1
    (1 row)
    
    dbadmin=> delete from jim where c1 = 1;
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> select c1 from jim;
     c1
    ----
    (0 rows)
    
  • Hi Jim,
    Thank you for your response.
    I created a flex table with 10.000 columns, loaded data (10.000 rows) and queried it.
    All columns were virtual. Then, part of them were materialized. A topk projection could be created over it.
    I thought of only inserting data in this flex table and with a topk projection to have fast access to the last record for every (year_nbr, obj_id, obj_position_id).
    But, without statistics, the topk responded 6-20 times slower than the flex table, and 20-40 times slower than a columnar table (limited to 1300 columns).

  • Hi Tom,
    Vertica works fine with sensor data, due to its capability to have up to 1600 columns.
    After storage, data is used for different kind of reports/forecasts. We have so much columns because there is a wide variety of sensors.
    It works/responds very well if only few columns are retrieved. It is easy to install, upgrade. Cluster import/export works fine. Loading of data (COPY) is fast.
    What I would like Vertica to have:

    • to consume less memory,
    • projection with 'where' conditions. It is not enough to have selection at column level, some times row level selection is necessary,
    • statistics for any kind of projection,
    • backup compressed/binary,
    • export/import to/from s3 works only for small tables.
      Anyway, thank you for your question.
  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited October 2017

    vcarusi - I was thinking about your issue again. Another option I was contemplating is to add a boolean column (maybe named display_this_row) and set it equal to TRUE for the rows you want to return in your query (i.e. last created records and statistics). That what you can add a simple predicate to your queries and only display those records...

    Like this:

    select 
    year_nbr, 
    obj_id , 
    obj_position_id,
    created,
    date , 
    month_nbr ,
    day_nbr ,
    hour_nbr ,
    "offset" ,
    deleted ,
    measure_1 ,
    ...
    measure_1290
    from measure
    where display_this_row;
    

    I would put the display_this_row column first in a projection's ORDER BY clause and would also partition the measure table by display_this_row. By putting that partition on there, Vertica won't even have to look at the storage container containing the data where display_this_row = FALSE! Although, I guess if you have the partition, there would be no need to put the display_this_row column in the ORDER BY clause of the projection :)

  • vcarusivcarusi
    edited October 2017

    Hi Jim,
    thank you for your involvement. In fact, this is our current approach.
    Anyway, the limitation to 1600 columns remains.
    And, it had been useful if Vertica could have had a projection with 'where' condition.
    Starting for our business is more useful to have partitions defined on (year_nbr , obj_id).
    Thank you, Jim!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file