Query performance: Using Hash in the WHERE clause

I've been reading the TalendHPVerticaTipsAndTechniques pdf, and it mentions the following (see attached image):

 

 {

If you are using the Talend Community Edition, you should add a WHERE clause to the original query to chunk the data. This example will result in 4 chunks.

original_sql + " and hash(" + primaryKey + ") % " + noOfThreads + " = " + i

Example:

select if.* from inventory_fact if, warehouse_dimension wd where if.warehouse_key = wd.warehouse_key

The query above results in the 4 queries below:

select if.* from inventory_fact if, warehouse_dimension wd where if.warehouse_key = wd.warehouse_key and hash(product_key, date_key) % 4 = 1;

select if.* from inventory_fact if, warehouse_dimension wd where if.warehouse_key = wd.warehouse_key and hash(product_key, date_key) % 4 = 2;

select if.* from inventory_fact if, warehouse_dimension wd where if.warehouse_key = wd.warehouse_key and hash(product_key, date_key) % 4 = 3;

select if.* from inventory_fact if, warehouse_dimension wd where if.warehouse_key = wd.warehouse_key and hash(product_key, date_key) % 4 = 4;

}

 

I am new to Vertica, and based on whatever I've read till now, I understand that Vertica creates a Superprojection that can be segmented across all nodes. What I'd like to understand is how the said Hash function in the WHERE clause will help in improving the performance. Will it create four buckets for the column values and read each of these bucket's data from different nodes in parallel? How is it different from segmenting the Projection? Or if the Projection is already segmented, how does this query work?

 

Any inputs on how the Hash function in the WHERE clause improves performance will be really helpful.

 

Thanks.

 

 

Comments

  •  The data in the projections you are about to use needs to be segmented on the same join key(your hash key).

    Why ? 

     - when you segment the data across your cluster that meens that no single projeciton will hold all the data in that table. 

     

    Why is faster ? 

    - the projections segmeneted on the same key in the same node wont have to look for data in other projection in other nodes(no resegmentation is needed and this is faster).

    - you wanna avoid resegmentation, but this is not always possible. 

     

    - in your case.

    SELECT
    if.*
    FROM
    inventory_fact IF,
    warehouse_dimension wd
    WHERE
    if.warehouse_key = wd.warehouse_key
    AND hash(product_key, date_key) % 4 = 1;

     

    - both inventory_fact and warehouse_dimension used projections have to be segmented on the same hash(product_key, date_key), the warehouse_key should be in the order by of the tables so no extra sorting would be done.

      

    any experts there please fell free to correct me if i am wrong.


  • Adrian_Oprea_1 wrote:

     The data in the projections you are about to use needs to be segmented on the same join key(your hash key).

    Why ? 

     - when you segment the data across your cluster that meens that no single projeciton will hold all the data in that table. 

     

    Why is faster ? 

    - the projections segmeneted on the same key in the same node wont have to look for data in other projection in other nodes(no resegmentation is needed and this is faster).

    - you wanna avoid resegmentation, but this is not always possible. 

     

    - in your case.

    SELECT
    if.*
    FROM
    inventory_fact IF,
    warehouse_dimension wd
    WHERE
    if.warehouse_key = wd.warehouse_key
    AND hash(product_key, date_key) % 4 = 1;

     

    - both inventory_fact and warehouse_dimension used projections have to be segmented on the same hash(product_key, date_key), the warehouse_key should be in the order by of the tables so no extra sorting would be done.

      

    any experts there please fell free to correct me if i am wrong.


    Thanks Adrian.


    So, that means using the Hash() will only improve performance if the Projections for both the tables have been segmented on these columns specifically? Did I get that right? If that's the case, if there were only the SuperProjection on the table, using Hash() in the Where clause wouldn't provide any improvement, right? Or does Vertica automatically store Hash values for all the columns, and that is why using Hash() in the WHERE clause could help?

  • The suggestion from Talend is a method for creating multiple parrallel load streams.  It's really independent of the segmentation of the underlying projections for the tables.  They are basically saying if you want four parallel streams, then using the hash() % 4 = 1, = 2, etc to get four separate distinct slices of data.  If you want 8 then use hash() % 8 = 1, = 2, etc.  Talend would create 4 or 8 different separate loads into Vertica, so you need to consider the amount of resources that you want to use for the parallel streams and how your resource pools are configured.  As long as the disk, network, and cpu resources aren't maxed out, this will allow you to do the load more quickly.  

     

    When each of the loads executes, each of them will redistribute the incoming data across all of the nodes in the cluster - that is where the segmentation of the underlying projections comes into play.

     

      --Sharon

     

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file