Primary Keys, Segmentation and Query Performance

Veena · May 2014

Hi Team,

We are trying to understand whether defining keys on a table will improve the query performance.

We understand that vertica automatically segments table depending on source table, table structure or data

We have two tables (A and both having 70 M records. We are executing below query on that table.

select * from A join B

on A.ky = b.ky

where A.col='xyz'

Approach 1. Created tables without keys

1. Query completed in 53 seconds

Approach 2. Created tables with Primary Keys on both tables. (PK on table A /PK on table

1. Without Analyze_statistics, Query Completed in 3 min

2. After executing Analyze_statistics function, Query completed in 55 seconds

Approach 3. Created tables with Primary Key - Foreign Key Relationship(PK-FK on table A /PK on table

1. Without Analyze_statistics, Query Completed in 1 min 20 s

2. After executing Analyze_statistics function, Query completed in 54 seconds

We didn’t find any difference in query performance with all the above approaches. Can you please help us in understanding how segmentation helps in query performance and which is the best path to choose. Also is there any system table which tells how the projections are segmented.

Adrian_Oprea_1 · May 2014

Hi Veena,
I think is good to know and understand how projections are optimized, stuff like (sort order,encoding,segmentation ... ) but what is always recommended it to use the DBD(database designer).
Have you tried to use the DBD ?

Segmentation will help you when you have big tables to spread the query execution workload across multiple nodes.
Related to primary key, when you have segmentation in use you will need to use a high-cardinality pk(high number of unique values) in order for the hash segmentation to be in optimum use, or you can use an virtual column with a unique values as key for your segmentation use.

Veena · May 2014

Hi Andrian

We have a star schema model where primary keys are unique keys only. But Dimension table is also a table with huge data. In this case, we didn't find any difference in query performance when we define a primary key on the table and when we have table without any constraints.

Can you please help us in understanding which is the best method in data modeling for handlinghuge data.

Thanks and Regards
Veena

Adrian_Oprea_1 · May 2014

Do you have any previous experience with other rdbms systems ?

You question is quite hard to answer in few lines.
I recommend you read the online Vertica docs https://my.vertica.com/docs/7.0.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/CREATEPROJE...
- this way you will have a better view over how Vertica uses other ways(encoding,sorting,segmentation,mpp) to get fast query results.
- Indexes and key are not a must have when performance is the case in Vertica, different from row-based database engines that will show poor performance if no such objects exists.
Related to your star schema make sure first that you create a comprehensive or query specific design using the DBD tool, run the analyze stat, check your histograms, learn how to use the profile and explain so that you can actively manage your queries, workload adviser is a must and make sure you understand how deletes are treated in Vertica, many skip this and complain after because of poor performance.

Veena · May 2014

Hi Adrian

I know, it might seems to be a little weird, when I say star schema and no keys. We have key columns in fact and dimension which we are using to join the tables. But we are not explicitly defining it as primary key and foreign key.

I understand that vertica query performance is much better than Oracle and also we understand that vertica doesn't need indexes as it is a column storage database. But my question is on constraints. Do we need it or not? As we know, primary keys are unique identifiers of a row and foreign keys are needed to maintain data integrity in RDBMS system. But in Vertica, while insert doesn't check primary key -foreign key constraints. So We are handling data integrity in ETL Level. I would like to know whether adding a primary key -foreign key constraint on a table will help us in query performance in any way.

Vertica manual states, when we have a Primary Key defined on a table, the default segmentation created by Vertica will be sorted and segmented on primary key. If so, will it help in query performance? We did some initial analysis which I mentioned in my first post. But not able to conclude anything with the results.

Daniel_Leybovic · May 2014

Once I did a test and found that PK/FK constraints has it own impact:
1. LOADs performance are reduced due CONSTRAINS validation (about ~5% only)
2. Query performance are improved for about ~6% (PK/FK can help to optimizer to understand tables dependencies if you have no queries, but queries helps better)

So I don't see any major reason to define PK/FK constraints (also because you can define CORRELATIONS), but I always define PK/FK, because it helps to humans to understand a schema(dependencies between tables).
https://my.vertica.com/docs/7.0.x/HTML/index.htm#Authoring/NewFeatures/7.0/7.0.0/AnalyzingCorrelated...

PS
I did compassion: LOAD without PK/FK => ALTER TABLE ...define PK/FK
It doesn't gives any improvement. Here no "Golden Rules", use in case that will fit your ETL requirements: do you need constrains validation or not? All in your hands.

Navin_C · May 2014

Hello Veena,

If you only want to use Primary Keys for performance purpose then I would say, You can go ahead having primary keys for Fact tables.
As said in manual, This will ensure proper segmentation of data and and this proper retrieval of data through queries.
But the faster retrieval also depends on query predicates, so if your query has a predicate which is your primary key, then it will be added advantage.
Ideally it is recommended to have a segmentation column which has high cardinality, if you are sure that the Primary key column will be having a high cardinality and less duplicates then you can have primary keys on your tables.
One Disadvantage :
When you are querying the tables having PK-FK relationship the constraints are checked by vertica and the query will fail if the primary key has duplicates or referential integrity issue between tables.

Adrian_Oprea_1 · May 2014

I think Daniel and Navin gave you an idea about PK and FK use in performance;
When using high cardinality PK for segmented tables Vertica will use that PK to distribute the data across all nodes in a optimum way , avoiding high data skew.
When you don't have any PK to fit this description is wise to create your own PK "virtual column PK" this way Vertica will make a better data distribution on all nodes.

Veena · May 2014

Thanks for the responses. We will also do some more analysis and will post if we have concluded any.

Primary Keys, Segmentation and Query Performance

Comments

Leave a Comment