Group by on High Cardinal dimension - Projection Tuning- Vertica

davidvilla · April 2016

I have a scenario where the group by is done on high cardinal dimension value field which has nearly 30Billion distinct values and the data type of this field is varchar. This query for a long time and I need to tune this query by creating projections in vertica. Could anyone provide me some ideas on it.

SruthiA · April 2016

Hi,

You can tune the query using database designer. Please find the below URL's for more information

https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/AnalyzingData/Optimizations/FirstStepsForImprovingQueryPerformance.htm?Highlight=optimize query

https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/GettingStartedGuide/UsingDatabaseDeisgner/RunningDBDwithAT.htm?Highlight=database designer

Sruthi

davidvilla · April 2016

I am sorry, I didnt mention that I dont have DBD to design my projection and I am creating custom projections to tune the query. I have those information that were providied as link above, but still it didnt help me.

HerbC · April 2016

If there is any commonality in the content of the distinct data, you could create what I would call a "synethic" lower cardinality column from the high cardinality data that will help to do an initial grouping.

For example, if the data is IP addresses for many servers, you could parse out the pieces of the IP address into separate columns, thereby "synthetically" creating lower cardinality columns from the high cardinality data.

Another example would be an office number. If the number is "12345678" and the first two digits are country code, the second are region code, etc. you could use those parts of the data in the high cardinality column to create two "synthetic" lower cardinality columns with which to order your projection.

You could even not change the table columns and use a live aggregate projection to create a new projection that has parses the data from the high cardinality column into lower cardinality columns by which you could sort the columns.

https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/AnalyzingData/AggregatedData/ExpressionsAggregateData.htm%3FTocPath%3DAnalyzing%2520Data%7CData%2520Aggregation%7CPre-Aggregating%2520Data%2520in%2520Projections%7CAggregating%2520Data%2520Through%2520Expressions%7C_____0

https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/AnalyzingData/AggregatedData/ExpressionsAggregateDataExample.htm%3FTocPath%3DAnalyzing%2520Data%7CData%2520Aggregation%7CPre-Aggregating%2520Data%2520in%2520Projections%7CAggregating%2520Data%2520Through%2520Expressions%7C_____1

Hopefully this gives you some type of an idea for an approach.

davidvilla wrote:
I have a scenario where the group by is done on high cardinal dimension value field which has nearly 30Billion distinct values and the data type of this field is varchar. This query for a long time and I need to tune this query by creating projections in vertica. Could anyone provide me some ideas on it.

davidvilla · April 2016

Hi Herb,

Thanks for the solution. I am using vertica 7.1v.

Also the high cardinal column doesnt have any particular pattern in it. It has numbers, characters, and alphanumeric values and the group by is done on this column.

example values

12

3736183

93836718393

NULL_STRING

pa.9d9jd8dddj3k39d9g9s9ec776

d7adf8jd75efc

Please suggest me a solution.

HerbC · April 2016

Are there any other lower cardinality columns in the table that have a correlation to the high cardinality data column? If so, it might be worth testing a column order with that high correlation column first in the sort order and put into your query. This might help to chunk the super high cardinality data in a way that makes it quicker to group and aggregate. Other than that, all I can think of is to create a user-defined resource pool with lots of reserved memory and run the query through that. With this super high cardinality in the group by column, it might just be a matter of brute force.

FiliN · May 2016

DId you try to create projection segmented by your high cardinality (hc) column. You should use mpp principle to run calculation on every node in your cluster.

Take the columns you need for that query only, segment projection by your hc column, apply encoding and order by hc column.

davidvilla · May 2016

Hi,

Unfortunately the value in this column cannot be correlated with any other and it is not derived field too. Thanks for your reply.

-Vinoth

davidvilla · May 2016

Hi,

I have tried segmentation based on HC column in my projection and made it in top of my order my clause, the result was the projection referesh happens for ever like it took more than 20 hours so I stopped it. Thanks

We're Moving!

Create My New Community Account Now

Group by on High Cardinal dimension - Projection Tuning- Vertica

Comments

Leave a Comment