statistics on topk projection

vcarusi · September 2017

Hi,
I have a question regarding collecting of statistics.
It's possible to collect them on a topk projection? how?
I am using 8.0 version.

Tried with:
select analyze_histogram(projection_name,100);
select analyze_statistics(projection_name);

Even if the operations completed with 0, with success, the projections' columns appear having statistics_type = ROWCOUNT , and the projection having has_statistics = false.

My problem is related to the velocity of querying the projection. It seems querying the topk projection the response time is 10 times greater then querying the table itself. The topk projection is order by appropriated columns.
The explain plan shows a lower cost (1000 times lower) when the projection is used, but the reality is different.
Thank you,
Veronica

TomM · September 2017

My experience is statistics are one of the keys to optimum query performance. This is the first time I've heard of issues around updating them for TopK. I'll run this by some technical people I'm friendly with and see what we can find out.

vcarusi · September 2017

Thank you!

Jim_Knicely · September 2017

Fyi... From the docs:

Restrictions (of ANALYZE_STATISTICS) - Vertica collects no statistics on live aggregate and Top-K projections that are anchored to the specified table.

See:
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_STATISTICS.htm

vcarusi · September 2017

yes, I read the documentation.
anyway, no error is thrown if I run

select analyze_histogram(projection_name,100);
select analyze_statistics(projection_name);

Question: how to collect statistics for a certain projection ?

drlove · September 2017

You can only run ANALYZE_STATISTICS against the entire database or a particular schema, table or column; it is not run against a projection.

Jim_Knicely · September 2017

Hi,

Here is an example in 8.1.1:

dbadmin=> select version();
              version
------------------------------------
 Vertica Analytic Database v8.1.1-5
(1 row)

dbadmin=> create table test (c1 int, c2 int) order by c1;
CREATE TABLE

dbadmin=> create projection test_jim as select * from test order by c2;
CREATE PROJECTION

dbadmin=> create projection test_ap as select c1, sum(c2) sum_c2 from test group by c1;
WARNING 6852:  Live Aggregate Projection "test_ap" will be created for "test". Data in "test" will be neither updated nor deleted
CREATE PROJECTION

dbadmin=> insert into test select 1, 1;
 OUTPUT
--------
      1
(1 row)

dbadmin=> commit;
COMMIT

dbadmin=> select projection_schema, projection_name, has_statistics from projections where anchor_table_name = 'test';
 projection_schema | projection_name | has_statistics
-------------------+-----------------+----------------
 public            | test_super      | f
 public            | test_jim        | f
 public            | test_ap         | f
(3 rows)

dbadmin=> select analyze_statistics('public.test_jim');
 analyze_statistics
--------------------
                  0
(1 row)

dbadmin=> select projection_schema, projection_name, has_statistics from projections where anchor_table_name = 'test';
 projection_schema | projection_name | has_statistics
-------------------+-----------------+----------------
 public            | test_super      | t
 public            | test_jim        | t
 public            | test_ap         | f
(3 rows)

dbadmin=> select analyze_statistics('public.test_super');
 analyze_statistics
--------------------
                  0
(1 row)

dbadmin=> select projection_schema, projection_name, has_statistics from projections where anchor_table_name = 'test';
 projection_schema | projection_name | has_statistics
-------------------+-----------------+----------------
 public            | test_super      | t
 public            | test_jim        | t
 public            | test_ap         | f
(3 rows)

dbadmin=> select analyze_statistics('public.test_ap');
 analyze_statistics
--------------------
                  0
(1 row)

dbadmin=> select projection_schema, projection_name, has_statistics from projections where anchor_table_name = 'test';
 projection_schema | projection_name | has_statistics
-------------------+-----------------+----------------
 public            | test_super      | t
 public            | test_jim        | t
 public            | test_ap         | f
(3 rows)

The only projection I can not get stats on is the aggregate projection ... Just like the doc says.

Jim_Knicely · September 2017

Fyi ... You don't get an error when you run analyze statistics on a Live Aggregate Projection because it does gather a row count stat...

From my example above:

dbadmin=> select distinct projection_name, statistics_type, statistics_updated_timestamp from projection_columns where projection_name in ('test_super', 'test_jim', 'test_ap');
 projection_name | statistics_type | statistics_updated_timestamp
-----------------+-----------------+-------------------------------
 test_jim        | FULL            | 2017-09-20 10:16:36.069871-04
 test_super      | FULL            | 2017-09-20 10:16:36.069871-04
 test_ap         | ROWCOUNT        | 2017-09-20 10:16:36.069871-04
(3 rows)

vcarusi · September 2017

Thank you very much,Jim!
All of your explanations I discovered myself.

So, your response to my question:
It's possible to collect them on a topk projection? how?
is : NO, isn't possible to have statistics for a topk projection . True?

How fast does topk projections respond without statistics ?

I will try to come back with example.

vcarusi · September 2017

I have a table with 1300 columns that stores data for 1290 sensors every 10 minutes, 10 columns being used to store location and time for every measurement.
From time to time a correction is received for a certain location/time. I need to consider only the last data/correction stored. For this reason I thought a topk projection can help.

CREATE TABLE measure
(
obj_position_id int NOT NULL,
obj_id int NOT NULL,
date timestamp NOT NULL,
year_nbr int NOT NULL,
month_nbr int NOT NULL ,
day_nbr int NOT NULL,
hour_nbr int NOT NULL ,
"offset" numeric(4,2) NOT NULL,
deleted int NOT NULL,
created timestamp NOT NULL,
measure_1 numeric(8,5),
...............
measure_1290 numeric(8,5)
)
order by obj_position_id, created;

create projection measure_last (
grouped(
year_nbr,
obj_id ,
obj_position_id,
created,
date ,
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted
),
measure_1 ,
...............
measure_1290 ,
)
select
year_nbr,
obj_id ,
obj_position_id,
created,
date ,
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted
measure_1 ,
...............
measure_1290 ,
from measure
order by year_nbr, obj_id, obj_position_id, created;

create projection measure_topk (
grouped(
year_nbr,
obj_id ,
obj_position_id,
created,
date ,
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted
),
measure_1 ,
...............
measure_1290 ,
)
select
year_nbr,
obj_id ,
obj_position_id,
created,
date ,
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted
measure_1 ,
...............
measure_1290 ,
from measure
limit 1 over (partition by year_nbr, obj_id, obj_position_id order by created desc);

select measure.obj_id, measure.year_nbr, count(1) cnt
from measure_last measure
where
measure.year_nbr = 2016
and measure.obj_id = 2
and measure.obj_position_id >= 220160101000000
and measure.obj_position_id < 220170101000000
group by measure.year_nbr, measure.obj_id;
-----response_time = 1sec

select measure.obj_id, measure.year_nbr, count(1) cnt
from measure_topk measure
where
measure.year_nbr = 2016
and measure.obj_id = 2
and measure.obj_position_id >= 220160101000000
and measure.obj_position_id < 220170101000000
group by measure.year_nbr, measure.obj_id;
-----response_time = 10sec

vcarusi · September 2017

-- select on normal projection
select measure_last .obj_id, measure_last .year_nbr, count(1) cnt
from measure_last
where
measure_last .year_nbr = 2016
and measure_last .obj_id = 2
and measure_last .obj_position_id >= 220160101000000
and measure_last .obj_position_id < 220170101000000
group by measure_last .year_nbr, measure_last .obj_id;
-- response_time 1 sec

Access Path:
+-GROUPBY PIPELINED [Cost: 617, Rows: 1] (PATH ID: 1)
| Aggregates: count(1)
| Group By: measure_last.year_nbr, measure_last.obj_id
| +---> STORAGE ACCESS for measure_last [Cost: 616, Rows: 641] (PATH ID: 2)
| | Projection: measure_last
| | Materialize: measure_last.year_nbr, measure_last.obj_id
| | Filter: (measure_last.year_nbr = 2016)
| | Filter: (measure_last.obj_id= 2)
| | Filter: ((measure_last.obj_position_id >= 220160101000000) AND (measure_last.obj_position_id < 220170101000000))

--select on topk projection
select measure.obj_id, measure.year_nbr, count(1) cnt
from measure_topk measure
where
measure.year_nbr = 2016
and measure.obj_id= 2
and measure.obj_position_id >= 220160101000000
and measure.obj_position_id < 220170101000000
group by measure.year_nbr, measure.obj_id;
--response time - 10 sec

Access Path:
+-GROUPBY PIPELINED [Cost: 423, Rows: 500 (NO STATISTICS)] (PATH ID: 1)
| Aggregates: count(1)
| Group By: measure.year_nbr, measure.obj_id
| +---> STORAGE ACCESS for measure[Cost: 384, Rows: 98K (NO STATISTICS)] (PATH ID: 2)
| | Projection: measure_topk
| | Materialize: measure.year_nbr, measure.obj_id
| | Filter: (measure.year_nbr = 2016)
| | Filter: (measure.obj_id= 2)
| | Filter: ((measure.obj_position_id>= 220160101000000) AND (measure.obj_position_id< 220170101000000))

vcarusi · September 2017

select count(1) from measure_last;
97831

select count(1) from measure_topk;
97831

vcarusi · September 2017

Even if the cost for topk projection is lower the response time is greater.
The only difference that I could see is
STORAGE ACCESS for measure[Cost: 384, Rows: 98K (NO STATISTICS)] (PATH ID: 2)
Question:

there is a way to collect statistics ?
if there isn't, what is the goal of their usage ?

vcarusi · September 2017

v_monitor.query_requests/
--------------------request_duration_ms
select measure_topk - 7941
select measure_last - 68

v_monitor.resource_acquisitions
-----------------------open_file_handle_count,thread_count,memory_inuse_kb,duration_ms
select measure_topk - 6 ,4 ,9365 ,4
select measure_last - 8 ,4 ,9239 ,11

Jim_Knicely · September 2017

Hi!

I think there is something wrong with your data or your TOPK projection. The TOPK projection typically has less rows than a superprojection because of LIMIT clause. But you said that both the measure_topk and measure_last projections both have 97,831 records. Maybe in your data there is only one record for each partition (year_nbr, obj_id and obj_position_id)? Should the partition clause in the TOPK projection be just on the year_nbr and obj_id columns?

Example of a typical use case:

dbadmin=> select export_objects('','test');
                                                                                                                                                                                                                                   export_objects                                                                           
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

CREATE TABLE public.test
(
    c1 int,
    c2 int
);

CREATE PROJECTION public.test_super /*+basename(test),createtype(P)*/
(
 c1,
 c2
)
AS
 SELECT test.c1,
        test.c2
 FROM public.test
 ORDER BY test.c1
SEGMENTED BY hash(test.c1, test.c2) ALL NODES OFFSET 0;

CREATE PROJECTION public.test_topk1
(
 c1,
 c2
)
AS
 SELECT test.c1,
        test.c2
 FROM public.test
 LIMIT 1
 OVER (
 PARTITION BY  test.c1
 ORDER BY  test.c2)
;

SELECT MARK_DESIGN_KSAFE(0);
(1 row)

dbadmin=> select count(*) from public.test_super;
   count
-----------
 345292288
(1 row)

dbadmin=> select count(*) from public.test_topk1;
 count
-------
     2
(1 row)

See the difference? 2 records vs 345,292,288!

dbadmin=> \x
Expanded display is on.

dbadmin=> select c1, c2, count(1) from test_super group by c1, c2;
 c1 | c2 |  count
----+----+----------
  0 |  5 | 17262123
  1 |  2 | 17272385
  1 |  8 | 17260787
  0 |  0 | 17257688
  0 |  6 | 17265756
  1 |  3 | 17268274
  1 |  9 | 17270888
  0 |  1 | 17258749
  0 |  7 | 17263256
  1 |  4 | 17262658
  0 |  2 | 17268376
  0 |  8 | 17263304
  1 |  5 | 17259300
  0 |  3 | 17269025
  0 |  9 | 17264743
  1 |  0 | 17268420
  1 |  6 | 17265067
  0 |  4 | 17257533
  1 |  1 | 17268451
  1 |  7 | 17265505
(20 rows)

Time: First fetch (20 rows): 5218.275 ms. All rows formatted: 5218.354 ms

dbadmin=> select c1, c2, count(1) from test_topk1 group by c1, c2;
 c1 | c2 | count
----+----+-------
  0 |  0 |     1
  1 |  0 |     1
(2 rows)

Time: First fetch (2 rows): 25.100 ms. All rows formatted: 25.149 ms

See in my case, the TOPK is << 1 second...

vcarusi · October 2017

hi,
I understand your point of view, but is not my case.
My table and topk projection have a similar number of records. Not the same, but similar.

I have a table with 1300 columns that stores data for 1290 sensors every 10 minutes, 10 columns being used to store location and time for every measurement.
From time to time a correction is received for a certain location/time. I need to consider only the last data/correction stored.

For the same combination of columns : (year_nbr, obj_id, obj_position_id), I can have 1 or more records, with different 'created' values. Most of them have only 1 record. I am interested in seeing only the records with maximum 'created' value, the last inserted record for a certain (year_nbr, obj_id, obj_position_id). For this reason I thought of using topk projection.

In a real situation :
select count(1) from measure_last;
50.000.000

select count(1) from measure_topk;
45.000.000

Do you have another idea how to have only the last created records and statistics, also ?

thank you!

Jim_Knicely · October 2017

Hi,

How often do you need to query the data that is NOT the last created records and statistics? One option could be to have a measure_hist table where you stick all of the old values. Then in the measure table, you only keep the latest. So when you do an UPDATE, you'd simply insert the current data in measure into measure_hist, then update the record in measure.

Or instead of a TOPK projection, maybe you can create a better projection for a LIMIT 1 query? Something like this?

create projection measure_pr (
year_nbr, 
obj_id , 
obj_position_id,
created,
date , 
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted ,
measure_1 ,
...
measure_1290
)
as
select 
year_nbr, 
obj_id , 
obj_position_id,
created,
date , 
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted ,
measure_1 ,
...
measure_1290
from measure
order by year_nbr, obj_id, obj_position_id, created
segmented by hash(year_nbr, obj_id, obj_position_id) all nodes;

So a query to get the last created records and statistics might look like this:

select measure.obj_id, measure.year_nbr cnt, measure_1, measure_1290
from measure
where
measure .year_nbr = 2016 
and measure .obj_id = 2 
and measure.obj_position_id >= 220160101000000 
and measure.obj_position_id < 220170101000000
limit 1 over (partition by year_nbr, obj_id, obj_position_id order by created desc);

vcarusi · October 2017

it's a good solution. thank you!
I intend to use a flex table, because it allows me to use more columns, the forecast is to have 5000 columns. Flex tables can't be updated.

ERROR: Unsupported access to flex table: No UPDATE support

TomM · October 2017

Hi. I've been following your posts, and are you able to comment on the use case for the sensor data? What is it gathering every 10 minutes and what do you do with it? Why 1,300 columns, projected to go to 5,000 columns?

Jim_Knicely · October 2017

Fyi ... all tables in Vertica are limited to 1600 columns. That includes FLEX tables.

Example:

dbadmin=> \! head -5 /home/dbadmin/jim.txt
create FLEX table super_wide_flex_table (
 col_1  int,
 col_2  int,
 col_3  int,
 col_4  int,

dbadmin=> \! tail -5 /home/dbadmin/jim.txt
 col_1596  int,
 col_1597  int,
 col_1598  int,
 col_1599  int,
 col_1600  int); -- too many columns!

dbadmin=> \i /home/dbadmin/jim.txt
vsql:/home/dbadmin/jim.txt:1601: ROLLBACK 2106:  A table/projection/view can only have up to 1600 columns -- this create statement has 1601

And yes, you cannot UPDATE a flex table, although you can INSERT:

See:
https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/FlexTables/InsertingDataIntoFlexTables.htm

Note you can also DELETE from a FLEX table...

Example:

dbadmin=> create flex table jim(c1 int);
CREATE TABLE

dbadmin=> insert into jim select 1;
 OUTPUT
--------
      1
(1 row)

dbadmin=> update jim set c1 = 2;
ERROR 6092:  Unsupported access to flex table: No UPDATE support

dbadmin=> select c1 from jim;
 c1
----
  1
(1 row)

dbadmin=> delete from jim where c1 = 1;
 OUTPUT
--------
      1
(1 row)

dbadmin=> select c1 from jim;
 c1
----
(0 rows)

vcarusi · October 2017

Hi Jim,
Thank you for your response.
I created a flex table with 10.000 columns, loaded data (10.000 rows) and queried it.
All columns were virtual. Then, part of them were materialized. A topk projection could be created over it.
I thought of only inserting data in this flex table and with a topk projection to have fast access to the last record for every (year_nbr, obj_id, obj_position_id).
But, without statistics, the topk responded 6-20 times slower than the flex table, and 20-40 times slower than a columnar table (limited to 1300 columns).

vcarusi · October 2017

Hi Tom,
Vertica works fine with sensor data, due to its capability to have up to 1600 columns.
After storage, data is used for different kind of reports/forecasts. We have so much columns because there is a wide variety of sensors.
It works/responds very well if only few columns are retrieved. It is easy to install, upgrade. Cluster import/export works fine. Loading of data (COPY) is fast.
What I would like Vertica to have:

to consume less memory,
projection with 'where' conditions. It is not enough to have selection at column level, some times row level selection is necessary,
statistics for any kind of projection,
backup compressed/binary,
export/import to/from s3 works only for small tables.
Anyway, thank you for your question.

Jim_Knicely · October 2017

vcarusi - I was thinking about your issue again. Another option I was contemplating is to add a boolean column (maybe named display_this_row) and set it equal to TRUE for the rows you want to return in your query (i.e. last created records and statistics). That what you can add a simple predicate to your queries and only display those records...

Like this:

select 
year_nbr, 
obj_id , 
obj_position_id,
created,
date , 
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted ,
measure_1 ,
...
measure_1290
from measure
where display_this_row;

I would put the display_this_row column first in a projection's ORDER BY clause and would also partition the measure table by display_this_row. By putting that partition on there, Vertica won't even have to look at the storage container containing the data where display_this_row = FALSE! Although, I guess if you have the partition, there would be no need to put the display_this_row column in the ORDER BY clause of the projection

vcarusi · October 2017

Hi Jim,
thank you for your involvement. In fact, this is our current approach.
Anyway, the limitation to 1600 columns remains.
And, it had been useful if Vertica could have had a projection with 'where' condition.
Starting for our business is more useful to have partitions defined on (year_nbr , obj_id).
Thank you, Jim!

We're Moving!

Create My New Community Account Now

statistics on topk projection

Comments

Leave a Comment