how the data is distributed to other nodes in a cluster?
when I load a large file into vertica in a node, how the data is distributed to the other nodes? Does it distribute to other nodes during the loading or in the commit, or in move out action. any document talk about the detail mechanism?
0
Comments
The data is distributed when you are loading data from a large file.
It all depends on your table segmentation clause.
Basically vertica uses hash segmentation by default,
Step in Loading mechanism
This complete mechanism takes place at time of loading.
Each hash id is unique for a value and this is how vertica ensure equal distribution of your data.
Hope this helps
now another question jump out.
For example, I have three nodes in a vertica cluster. There is a table T1 with three segments(S1/S2/S3). S1 locate in Node1 , S2 locate in Node2 and S3 locate in Node3. now I want to run the sql:
insert into T2 select * from T1;
case1: if the T2 has the same segment clause with T1, does it means all records in S1 will be inserted into T2 and still locate in Node1. I mean the data will not distributed to other nodes. and so on for S2and S3.
case2: if the T2 has different segment clause with T2(but cross all three nodes), so the records of S1 will be distributed to all three nodes. and so on for S2 and S3. Am I right?
I'm not sure if the case1 will have better performance than case2.
However in the above scenario we will be referring them as segments.
for Query All the data is first collected and then inserted into table.
Lets consider there are two tables table T1 and table T2
Table A has 5 columns which includes one primary key
Table B has 8 column which include one primary key
Both the Primary keys in table A and table B are same.
For Case 1
case1: if the T2 has the same segment clause with T1, does it means all records in S1 will be inserted into T2 and still locate in Node1. I mean the data will not distributed to other nodes. and so on for S2 and S3.
If T2 has same segmentation clause as T1 then the data will be distributedin same manner as it is for T1.
As both the tables are segmented on same primary key column, their hash ID are unique and same for similar values in both tables so they will segmented identically.
For Case 2
case2: if the T2 has different segment clause with T2(but cross all three nodes), so the records of S1 will be distributed to all three nodes. and so on for S2 and S3. Am I right?
I'm not sure if the case1 will have better performance than case2.
Yes if segmentation clause is different for T2 then the data from s1/s2/s3 will be again distributed according to their hash ID generated.
For performance:
If you want a good loading performance with INSERT INTO SELECT clause, I would say go with case 2.
In any case, there is always a replica of your projection for tables which identically segmented. Those projections are called ISP (identically segmented projections), but they are on offset nodes.
Hope this helps