spark vertica connector architecture problem

Ben_Fishman · April 2016

hey

i noticed for something bad in how the connector works.

just for mention

i have big table on vertica , segmented by key on all nodes....

as your documentation said , spark rdd query data from vertica as how the table segmented and spark partition

is build on top of that.

but as i said, when i print each parition data after quering segmented table, i alaways get 1 partition with all the data from a random server and the rest contain 0 rows !

its looks like vertica connector merge all the data to random node instead of leave it disterbuted as the servers as documentation said.....

its really important for me to know if im doing something worng

waiting for respone

thanks

Ben Fishman.

[Deleted User] · April 2016

Hi Ben,

I've passed your message over to someone on the HPE team working on our Spark integration. Hopefully they will reach out soon.

Thanks,
Mark

HPE Vertica Product

[Deleted User] · April 2016

duplicate

EdwardM · April 2016

Hi Ben,

Which version of the connector are you using? Are you able to print out the queries issued by the connector?

Thank you,

Ed

Ben_Fishman · April 2016

Hey

Thanks for your quick reply

Im using vertica 7.2.1 and spark connector 0.2.2

Lets says i have table tbl (a varchar,b varchar,c varchar) segmented by c all nodes.

When i query select a,b,c from spark

On vertica logs written:

select a,b,c from tbl where( (0x00000000ffffffff & hash(c) >= 2863311532 and (0x00000000ffffffff & hash(c)) <= 3101920825 )

Each query written on vertica.log alot each time with diffrent hash range...

Waiting for respone

Ben.

EdwardM · April 2016

Hey Ben,

Thanks for reporting this.

I may know some possible causes. Can you try two things?

(1) Verify that you don't have skewed data. If you have all of your table data segmented by c and all of the c data are similar values, you can end up with all of the data being in one range. Therefore, you can try to run

select count(*) from (select a,b,c from tbl where( (0x00000000ffffffff & hash(c) >= 2863311532 and (0x00000000ffffffff & hash(c)) <= 3101920825 ) as foo;

for each of the ranges to see how many naturally fall into each hash range.

(2) Another issue (that we're aware of) is that the table name (when passed to the connector) can be case-sensitive. For example, if your table is named "Test", and you use "test" as your table name option, it can result in this behavior (no crash, but all data go to one partition). We will fix this is the next release. In the meantime, can you (if this is the case for you), provide the table name with an exact upper and lower case match?

Thanks,

Edward

Ben_Fishman · April 2016

Hey

(1) The data in the table is segmented almost equally so from that point , everything is good

(2) the names of the tables that i had queried in my code is in lower case. (schema too)

One thing that i can say is that i query the data from a view that i had created on vertica

(select * from where c in (select * from tmp)

* tmp table is overwrite each iteration with new keys

and the view query only relevant keys

I must say that before i used the view in my code. i had only query simple table and still vertica behavior was the same (1 partition with all the data)

Thanks

Ben

Jeff-L · April 2016

Hi Ben,

Edward and I discussed a bit, and cannot duplicate this error so we would like to ask you to please give a quick test if possible. Basically we create a very small table then confirm the distribution on Vertica. Then run the connector and check the same distribution on Spark partitions. Depending on your number of vertica nodes, please enter at least that many unique values.

// lets say you have 4 Vertica nodes...

Create table create table t (a int);

insert into t values ('1');

insert into t values ('2');

insert into t values ('3');

insert into t values ('4');

commit;

// verify the distibution of rows on each Vertica node

select s.node_name,s.schema_name,s.projection_name,s.storage_type,s.total_row_count from v_monitor.storage_containers s where s.projection_name='t_b0';

Please execute the connector with 4 partitions (again assuming you have 4 Vertica nodes) and see if the distribution is the same, which it should be.

Thank you,

Jeff

We're Moving!

Create My New Community Account Now

spark vertica connector architecture problem

Comments

Leave a Comment