statistics on topk projection
Hi,
I have a question regarding collecting of statistics.
It's possible to collect them on a topk projection? how?
I am using 8.0 version.
Tried with:
select analyze_histogram(projection_name,100);
select analyze_statistics(projection_name);
Even if the operations completed with 0, with success, the projections' columns appear having statistics_type = ROWCOUNT , and the projection having has_statistics = false.
My problem is related to the velocity of querying the projection. It seems querying the topk projection the response time is 10 times greater then querying the table itself. The topk projection is order by appropriated columns.
The explain plan shows a lower cost (1000 times lower) when the projection is used, but the reality is different.
Thank you,
Veronica
Comments
My experience is statistics are one of the keys to optimum query performance. This is the first time I've heard of issues around updating them for TopK. I'll run this by some technical people I'm friendly with and see what we can find out.
Thank you!
Fyi... From the docs:
Restrictions (of ANALYZE_STATISTICS) - Vertica collects no statistics on live aggregate and Top-K projections that are anchored to the specified table.
See:
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_STATISTICS.htm
yes, I read the documentation.
anyway, no error is thrown if I run
select analyze_histogram(projection_name,100);
select analyze_statistics(projection_name);
Question: how to collect statistics for a certain projection ?
You can only run ANALYZE_STATISTICS against the entire database or a particular schema, table or column; it is not run against a projection.
Hi,
Here is an example in 8.1.1:
The only projection I can not get stats on is the aggregate projection ... Just like the doc says.
Fyi ... You don't get an error when you run analyze statistics on a Live Aggregate Projection because it does gather a row count stat...
From my example above:
Thank you very much,Jim!
All of your explanations I discovered myself.
So, your response to my question:
It's possible to collect them on a topk projection? how?
is : NO, isn't possible to have statistics for a topk projection . True?
How fast does topk projections respond without statistics ?
I will try to come back with example.
I have a table with 1300 columns that stores data for 1290 sensors every 10 minutes, 10 columns being used to store location and time for every measurement.
From time to time a correction is received for a certain location/time. I need to consider only the last data/correction stored. For this reason I thought a topk projection can help.
CREATE TABLE measure
(
obj_position_id int NOT NULL,
obj_id int NOT NULL,
date timestamp NOT NULL,
year_nbr int NOT NULL,
month_nbr int NOT NULL ,
day_nbr int NOT NULL,
hour_nbr int NOT NULL ,
"offset" numeric(4,2) NOT NULL,
deleted int NOT NULL,
created timestamp NOT NULL,
measure_1 numeric(8,5),
...............
measure_1290 numeric(8,5)
)
order by obj_position_id, created;
create projection measure_last (
grouped(
year_nbr,
obj_id ,
obj_position_id,
created,
date ,
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted
),
measure_1 ,
...............
measure_1290 ,
)
select
year_nbr,
obj_id ,
obj_position_id,
created,
date ,
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted
measure_1 ,
...............
measure_1290 ,
from measure
order by year_nbr, obj_id, obj_position_id, created;
create projection measure_topk (
grouped(
year_nbr,
obj_id ,
obj_position_id,
created,
date ,
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted
),
measure_1 ,
...............
measure_1290 ,
)
select
year_nbr,
obj_id ,
obj_position_id,
created,
date ,
month_nbr ,
day_nbr ,
hour_nbr ,
"offset" ,
deleted
measure_1 ,
...............
measure_1290 ,
from measure
limit 1 over (partition by year_nbr, obj_id, obj_position_id order by created desc);
select measure.obj_id, measure.year_nbr, count(1) cnt
from measure_last measure
where
measure.year_nbr = 2016
and measure.obj_id = 2
and measure.obj_position_id >= 220160101000000
and measure.obj_position_id < 220170101000000
group by measure.year_nbr, measure.obj_id;
-----response_time = 1sec
select measure.obj_id, measure.year_nbr, count(1) cnt
from measure_topk measure
where
measure.year_nbr = 2016
and measure.obj_id = 2
and measure.obj_position_id >= 220160101000000
and measure.obj_position_id < 220170101000000
group by measure.year_nbr, measure.obj_id;
-----response_time = 10sec
-- select on normal projection
select measure_last .obj_id, measure_last .year_nbr, count(1) cnt
from measure_last
where
measure_last .year_nbr = 2016
and measure_last .obj_id = 2
and measure_last .obj_position_id >= 220160101000000
and measure_last .obj_position_id < 220170101000000
group by measure_last .year_nbr, measure_last .obj_id;
-- response_time 1 sec
Access Path:
+-GROUPBY PIPELINED [Cost: 617, Rows: 1] (PATH ID: 1)
| Aggregates: count(1)
| Group By: measure_last.year_nbr, measure_last.obj_id
| +---> STORAGE ACCESS for measure_last [Cost: 616, Rows: 641] (PATH ID: 2)
| | Projection: measure_last
| | Materialize: measure_last.year_nbr, measure_last.obj_id
| | Filter: (measure_last.year_nbr = 2016)
| | Filter: (measure_last.obj_id= 2)
| | Filter: ((measure_last.obj_position_id >= 220160101000000) AND (measure_last.obj_position_id < 220170101000000))
--select on topk projection
select measure.obj_id, measure.year_nbr, count(1) cnt
from measure_topk measure
where
measure.year_nbr = 2016
and measure.obj_id= 2
and measure.obj_position_id >= 220160101000000
and measure.obj_position_id < 220170101000000
group by measure.year_nbr, measure.obj_id;
--response time - 10 sec
Access Path:
+-GROUPBY PIPELINED [Cost: 423, Rows: 500 (NO STATISTICS)] (PATH ID: 1)
| Aggregates: count(1)
| Group By: measure.year_nbr, measure.obj_id
| +---> STORAGE ACCESS for measure[Cost: 384, Rows: 98K (NO STATISTICS)] (PATH ID: 2)
| | Projection: measure_topk
| | Materialize: measure.year_nbr, measure.obj_id
| | Filter: (measure.year_nbr = 2016)
| | Filter: (measure.obj_id= 2)
| | Filter: ((measure.obj_position_id>= 220160101000000) AND (measure.obj_position_id< 220170101000000))
select count(1) from measure_last;
97831
select count(1) from measure_topk;
97831
Even if the cost for topk projection is lower the response time is greater.
The only difference that I could see is
STORAGE ACCESS for measure[Cost: 384, Rows: 98K (NO STATISTICS)] (PATH ID: 2)
Question:
v_monitor.query_requests/
--------------------request_duration_ms
select measure_topk - 7941
select measure_last - 68
v_monitor.resource_acquisitions
-----------------------open_file_handle_count,thread_count,memory_inuse_kb,duration_ms
select measure_topk - 6 ,4 ,9365 ,4
select measure_last - 8 ,4 ,9239 ,11
Hi!
I think there is something wrong with your data or your TOPK projection. The TOPK projection typically has less rows than a superprojection because of LIMIT clause. But you said that both the measure_topk and measure_last projections both have 97,831 records. Maybe in your data there is only one record for each partition (year_nbr, obj_id and obj_position_id)? Should the partition clause in the TOPK projection be just on the year_nbr and obj_id columns?
Example of a typical use case:
See the difference? 2 records vs 345,292,288!
See in my case, the TOPK is << 1 second...
hi,
I understand your point of view, but is not my case.
My table and topk projection have a similar number of records. Not the same, but similar.
I have a table with 1300 columns that stores data for 1290 sensors every 10 minutes, 10 columns being used to store location and time for every measurement.
From time to time a correction is received for a certain location/time. I need to consider only the last data/correction stored.
For the same combination of columns : (year_nbr, obj_id, obj_position_id), I can have 1 or more records, with different 'created' values. Most of them have only 1 record. I am interested in seeing only the records with maximum 'created' value, the last inserted record for a certain (year_nbr, obj_id, obj_position_id). For this reason I thought of using topk projection.
In a real situation :
select count(1) from measure_last;
50.000.000
select count(1) from measure_topk;
45.000.000
Do you have another idea how to have only the last created records and statistics, also ?
thank you!
Hi,
How often do you need to query the data that is NOT the last created records and statistics? One option could be to have a measure_hist table where you stick all of the old values. Then in the measure table, you only keep the latest. So when you do an UPDATE, you'd simply insert the current data in measure into measure_hist, then update the record in measure.
Or instead of a TOPK projection, maybe you can create a better projection for a LIMIT 1 query? Something like this?
So a query to get the last created records and statistics might look like this:
it's a good solution. thank you!
I intend to use a flex table, because it allows me to use more columns, the forecast is to have 5000 columns. Flex tables can't be updated.
ERROR: Unsupported access to flex table: No UPDATE support
Hi. I've been following your posts, and are you able to comment on the use case for the sensor data? What is it gathering every 10 minutes and what do you do with it? Why 1,300 columns, projected to go to 5,000 columns?
Fyi ... all tables in Vertica are limited to 1600 columns. That includes FLEX tables.
Example:
And yes, you cannot UPDATE a flex table, although you can INSERT:
See:
https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/FlexTables/InsertingDataIntoFlexTables.htm
Note you can also DELETE from a FLEX table...
Example:
Hi Jim,
Thank you for your response.
I created a flex table with 10.000 columns, loaded data (10.000 rows) and queried it.
All columns were virtual. Then, part of them were materialized. A topk projection could be created over it.
I thought of only inserting data in this flex table and with a topk projection to have fast access to the last record for every (year_nbr, obj_id, obj_position_id).
But, without statistics, the topk responded 6-20 times slower than the flex table, and 20-40 times slower than a columnar table (limited to 1300 columns).
Hi Tom,
Vertica works fine with sensor data, due to its capability to have up to 1600 columns.
After storage, data is used for different kind of reports/forecasts. We have so much columns because there is a wide variety of sensors.
It works/responds very well if only few columns are retrieved. It is easy to install, upgrade. Cluster import/export works fine. Loading of data (COPY) is fast.
What I would like Vertica to have:
Anyway, thank you for your question.
vcarusi - I was thinking about your issue again. Another option I was contemplating is to add a boolean column (maybe named display_this_row) and set it equal to TRUE for the rows you want to return in your query (i.e. last created records and statistics). That what you can add a simple predicate to your queries and only display those records...
Like this:
I would put the display_this_row column first in a projection's ORDER BY clause and would also partition the measure table by display_this_row. By putting that partition on there, Vertica won't even have to look at the storage container containing the data where display_this_row = FALSE! Although, I guess if you have the partition, there would be no need to put the display_this_row column in the ORDER BY clause of the projection
Hi Jim,
thank you for your involvement. In fact, this is our current approach.
Anyway, the limitation to 1600 columns remains.
And, it had been useful if Vertica could have had a projection with 'where' condition.
Starting for our business is more useful to have partitions defined on (year_nbr , obj_id).
Thank you, Jim!