How to optimize query
For the below query I am seeing Group by spill which causing too slow execution. How can I optimize the below query.
SELECT "TABLE1"."c1" AS "c1" , "TABLE1"."c1_d" AS "c1_d" , "TABLE1"."c2" AS "c2" , "TABLE1"."c3" AS "c3" , "TABLE1"."c4" AS "c4" , "TABLE1"."b1" AS "b1" , "TABLE1"."b1_t" AS "b1_t" , "TABLE2"."c5" AS "c5" , "TABLE1"."c6" AS "c6" , "TABLE2"."c7" AS "c7" , "TABLE1"."c8" AS "c8" , "TABLE1"."c9" AS "c9" , "TABLE1"."c10" AS "c10" , "TABLE1"."c10_d" AS "c10_d" , "TABLE1"."c11" AS "c11" , "TABLE1"."c11_d" AS "c11_d" , "TABLE2”.”c12” AS “c2” , "TABLE1”.”c13” AS “c13” , "TABLE2”.”c14” AS “c14” , "TABLE1”.”c15” AS “c15” , "TABLE2”.”c16” AS “c16” , "TABLE1"."id" AS "id" , "TABLE1"."b2” AS "b2" , "TABLE1"."b2_t" AS "b2_t" , "TABLE1”.”c17” AS “c17” , "TABLE1"."b3" AS "b3" , "TABLE1"."b3_t" AS "b3_t" , "TABLE1"."last_b1" AS "last_b1" , "TABLE1"."last_b1_t" AS "last_b1_t" , "TABLE1”.”c18” AS “c18” , "TABLE2”.”c19” AS “c19” , "TABLE1”.”c20” AS “c20” , "TABLE2”.”c21” AS “c21” , "TABLE1”.”c22” AS “c22” , "TABLE2”.”c23” AS “c23” , "TABLE1"."b4" AS "b4" , "TABLE1"."b4_t" AS "b4_t" , "TABLE2”.”c25” AS “c25” , "TABLE1”.”c26” AS “c26” , "TABLE2”.”c27” AS “c27” , "TABLE2”.”c28” AS “c28” , "TABLE2”.”c29” AS “c29” , "TABLE2”.”c30” AS “c30” , "TABLE2”.”c31” AS “c31” , "TABLE1”.”c32” AS “c32” , avg(“TABLE1"."price") AS "sum_price" , "TABLE1”.”c33” AS “c33” FROM "SCHEMA"."TABLE1" "TABLE1" LEFT JOIN "SCHEMA"."TABLE2" "TABLE2" ON ("TABLE1"."id" = "TABLE2"."id") GROUP BY 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47
Tagged:
0
Comments
Be sure statistics are up to date - the explain plan will indicate this. After that, the quickest way is to copy this query to a file and run this through DataBase Designer (DBD) from admintools or Management Console. You can run an incremental design only on the relevant schema(s) and table(s) using the query file, then analyze statistics and build the design. In case of incremental design, I usually don't deploy automatically, instead I download and review the proposed projections to verify they seem reasonable. In particular, you'll want TABLE1 and TABLE2 segmented by id, sorted by id, (segment/sort will help ensure that id's to be joined are on same node) and grouped by the keys listed here for best performance.
To start with:
If you have to group by 46 columns, I would really challenge your database design - or, at least, the way your query is formulated..
If , say, columns 5 through 46 depend directly on columns 1,2,3 and 4, then, by all means, create a subselect with columns 1,2,3,4 and AVG(price), GROUPing by 1,2,3,4, and then join that subselect with the previous table(s) to retrieve columns 5 through 46.
And - as soon as you have long strings in the columns you GROUP BY, you risk GROUP BY spilling, too. Don't keep a VARCHAR(256) in your table definition, if your maximum string is only 32 bytes long, for example.
Happy playing -
Marco
@marcothesane : Thanks for your info. Can you please tell me in what conditions we encounter with group by spill. Earlier I was thinking keeping many columns in group by cause this but its happening even with single group by column. And how group by spill cause query performance? what are the ways to avoid this?
'# start_lecture #
Vertica projections are always sorted. If you say nothing when creating the table, in the newest Vertica versions, by the first 8 columns of the table - and before, by the first 32 columns of the table. If you have one, by the primary key. But you can sort by whatever column or group of columns you choose.
In a query like this one:
If
sales_fact
is ordered by something else thanprod_id
, the scan of the table will proceed along this other sort order. You read prod ID 1, then another prod ID, and, until the very end, you must be prepared to read yet another prod ID of 1. So you need to keep intermediate sum results for each encountered different prod ID. The DBMS creates a hash table to accommodate a bucket for each different prod ID it encounters. If you have millions of different prod ID-s, the hash table won't fit into memory, and it will spill to disk, even if it is just an integer. With columns that require much more than the 8 bytes of an integer, like VARCHARs, it gets much worse. Vertica can't tell in advance what will be the effective maximum length of the data it will find in a VARCHAR(2000). Even if the longest existing string is 32, Vertica will have to pre-reserve 2000 bytes for each column entry in the hash table, because in theory it could get as long as that. And if 32 bytes * 1 million rows might fit into memory, 2000 bytes * 1 million rows probably won't , and you will spill to disk.Avoiding it is trying to reduce the length definition of VARCHAR columns in the CREATE TABLE; or, better, GROUP-ing by the HASH() of a VARCHAR, and transform the HASH() value back to the string using a previously built lookup table.
And you won't need any memory at all for a GROUP BY if you order a projection of
sales_fact
by theprod_id
first. In that case, when you read a bunch of rows withprod_id
of 1, then you encounter aprod_id
of 2, you can be sure that you won't encounter anyprod_id
of 1 ever again in this query, and the sum you have now is final.'# stop_lecture #
Cheers -
I created table like this:
CREATE TABLE T
(
p_id varchar(5) NOT NULL,
t_id varchar(5) NOT NULL,
a_date date NOT NULL
)
and executing below query:
but still getting group by spill(Table has only 3 rows).
below is the digraph of above query:
3[label = "GroupByHash(Spill): 1 keys\nAggs:\nUnc: Date(8)", color = "green", shape = "box"];
@michaeljack - A properly ordered projection can avoid the GroupByHash(Spill).
See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/SQLAnalytics/AvoidingGROUPBYHASHWithProjectionDesign.htm
Example:
You only get that string:
GroupByHash(Spill)
in the explain phase. Explain shows what might happen. Not what actually happened.You probably did not run a
SELECT ANALYZE_STATISTICS('t');
, so the optimizer does not know if it is a three billion row or a three row table. It prepares for a GROUP BY spill. It does not do it.Try this one here below - it does exactly what you are stating above.
In my example, it goes into GROUP BY PIPELINED.
Can you build a similar scenario for us in which you do run into GROUP BY spilled?
@marcothesane : If i create a projection with a sort order like
order by col1,col2,col3
what does that translate to from a storage perspective...
Does it mean that col1,col2 and col3 are independently stored sorted ? Is there any collocation of rows based on this column ordering?
What do you mean in detail?
An additional projection to the one you already have will of course consume disk space - in the same order of magnitude , a bit less, a bit more, than the projection that already exists.
And Vertica is column oriented - it's the first real columnar database . So, yes, col1, col2 and col3 are independently stored - each column is basically a file by itself.
And co-location happens by segmentation, not by sorting, and is used for joins, not for grouping.
If you order a table's projection by the columns you use in the GROUP BY clause, you will never spill to disk:
If you GROUP BY customer_id, and the table is ORDER BY customer_id (and maybe other columns after that), all rows with customer_id 1 will be together. As soon as you reach a customer_id of 2, you can be sure that no customer_id of 1 will follow, and you can pass on the grouping result for customer_id 1 without having to store it. So you don't need any memory for GROUP BY at all, and you cannot possibly spill to disk ....
Does this answer your question?
I mean. let say I have below table.
If I do order by p_id, col1 then first data will be sorted by p_id and then by col1 in RDBMS.
Now my questions is, same thing will happen in Vertica as well if I keep these 2 fields in projection's order by clause or it will sort both column independently and will store independently. like below.
If you order like so:
, then, if you fire a query:
with no
ORDER BY
in the query,the result will be:
It behaves just like a classic relational database. A row will never be split apart.
Internally, there is a system of file based pointers to enable reading the file that constitutes one column, and jump to the matching instances of the other column(s) and re-assemble the tuple/row.
@marcothesane: Thank you so much for your help. I learned so many things from you about Vertica. It helped me a lot to optimize my query.
If you order like so:
, then, if you fire a query:
with no
ORDER BY
in the query,the result will be:
It behaves just like a classic relational database. A row will never be split apart.
Internally, there is a system of file based pointers to enable reading the file that constitutes one column, and jump to the matching instances of the other column(s) and re-assemble the tuple/row.
You're welcome. You get most out of the database if you know how it works - just as you can drive a motorbike with a sidecar better if you know how it works. And the more you get from Vertica, the more you like it ... :-]]