spark vertica connector architecture problem

hey 

 

i noticed for  something bad in how the connector works.

 

just for mention

i have big table on vertica , segmented by key on all nodes....

 

as your documentation said , spark rdd query data from vertica as how the table segmented and spark partition

is build on top of that.

 

but as i said, when i print each parition data after quering segmented table, i alaways get 1 partition with all the data from a random server and the rest contain 0 rows !

 

its looks like vertica connector merge all the data to random node instead of leave it disterbuted as the servers as documentation said.....

 

its really important for me to know if im doing something worng

 

waiting for respone

thanks

 

Ben Fishman.

Comments

  • [Deleted User][Deleted User] Administrator

    Hi Ben,

     

    I've passed your message over to someone on the HPE team working on our Spark integration. Hopefully they will reach out soon.

     

    Thanks,
    Mark

     

    HPE Vertica Product

  • [Deleted User][Deleted User] Administrator

    duplicate

  • Hi Ben,

     

    Which version of the connector are you using? Are you able to print out the queries issued by the connector? 

     

    Thank you,

    Ed

  • Hey 

     

    Thanks for your quick reply

     

    Im using vertica 7.2.1 and spark connector 0.2.2

     

     

    Lets says i have table tbl (a varchar,b varchar,c varchar) segmented by c all nodes.

    When i query select a,b,c from spark

     

    On vertica logs written:

     

    select a,b,c from tbl where( (0x00000000ffffffff & hash(c) >= 2863311532 and (0x00000000ffffffff & hash(c)) <= 3101920825 )

     

    Each query written on vertica.log alot each time with diffrent hash range...

     

     

    Waiting for respone

     

    Ben.

     

     

     

  • Hey Ben,

     

    Thanks for reporting this.

     

    I may know some possible causes. Can you try two things?

     

    (1) Verify that you don't have skewed data. If you have all of your table data segmented by c and all of the c data are similar values, you can end up with all of the data being in one range. Therefore, you can try to run

     

    select count(*) from (select a,b,c from tbl where( (0x00000000ffffffff & hash(c) >= 2863311532 and (0x00000000ffffffff & hash(c)) <= 3101920825 ) as foo;

     

    for each of the ranges to see how many naturally fall into each hash range.

     

    (2) Another issue (that we're aware of) is that the table name (when passed to the connector) can be case-sensitive. For example, if your table is named "Test", and you use "test" as your table name option, it can result in this behavior (no crash, but all data go to one partition). We will fix this is the next release. In the meantime, can you (if this is the case for you), provide the table name with an exact upper and lower case match?

     

    Thanks,

    Edward

  • Hey 

     

    (1) The data in the table is segmented almost equally so from that point , everything is good

     

    (2) the names of the tables that i had queried in my code is in lower case.  (schema too)

     

    One thing that i can say is that i query the data from a view that i had created on vertica

    (select * from where c in (select * from tmp)

     

    * tmp table is overwrite each iteration with new keys

      and the view query only relevant keys 

     

    I must say that before i used the view in my code. i had only query simple table and still vertica behavior was the same (1 partition with all the data)

     

    Thanks

     

    Ben

     

     

  • Hi Ben,

     

    Edward and I discussed a bit, and cannot duplicate this error so we would like to ask you to please give a quick test if possible.    Basically we create a very small table then confirm the distribution on Vertica.  Then run the connector and check the same distribution on Spark partitions.   Depending on your number of vertica nodes, please enter at least that many unique values.

     

    // lets say you have 4 Vertica nodes...

    Create table create table t (a int);

    insert into t values ('1');

    insert into t values ('2');

    insert into t values ('3');

    insert into t values ('4');

    commit;

     

    // verify the distibution of rows on each Vertica node

    select  s.node_name,s.schema_name,s.projection_name,s.storage_type,s.total_row_count from v_monitor.storage_containers s where s.projection_name='t_b0';

     

    Please execute the connector with 4 partitions (again assuming you have 4 Vertica nodes) and see if the distribution is the same, which it should be.  

     

    Thank you,

    Jeff

     

     

     

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file