Hash function is not reproducible

Running hash twice on two tables gives different result

Comments

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited July 2019

    That seems highly unlikely.

    dbadmin=> CREATE TABLE a (user_id INT);
    CREATE TABLE
    
    dbadmin=> CREATE TABLE b (user_id INT);
    CREATE TABLE
    
    dbadmin=> INSERT INTO a SELECT 26712;
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> INSERT INTO b SELECT 26712;
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> SELECT user_id, HASH(user_id), HASH(user_id, 'a') FROM a
    dbadmin-> UNION ALL
    dbadmin-> SELECT user_id, HASH(user_id), HASH(user_id, 'a') FROM b;
     user_id |        HASH        |        HASH
    ---------+--------------------+---------------------
       26712 | 734013455889912858 | 6622807832315002947
       26712 | 734013455889912858 | 6622807832315002947
    (2 rows)
    

    What client tool are you using? Try running your query in vsql.

  • Bryan_HBryan_H Vertica Employee Administrator

    Is user_id the same data type in both tables? I get one result as integer, and a different result as varchar:

    dbadmin=> select hash(26712, 'a');

    hash

    6622807832315002947
    (1 row)

    dbadmin=> select hash('26712', 'a');

    hash

    3488463230331277936
    (1 row)

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited July 2019

    I though of that :) But you cannot UNION ALL a VARCHAR with an INT.

    dbadmin=> \d a
                                    List of Fields by Tables
     Schema | Table | Column  | Type | Size | Default | Not Null | Primary Key | Foreign Key
    --------+-------+---------+------+------+---------+----------+-------------+-------------
     public | a     | user_id | int  |    8 |         | f        | f           |
    (1 row)
    
    dbadmin=> \d b
                                        List of Fields by Tables
     Schema | Table | Column  |    Type     | Size | Default | Not Null | Primary Key | Foreign Key
    --------+-------+---------+-------------+------+---------+----------+-------------+-------------
     public | b     | user_id | varchar(10) |   10 |         | f        | f           |
    (1 row)
    
    dbadmin=> SELECT * FROM a;
     user_id
    ---------
       26712
    (1 row)
    
    dbadmin=> SELECT * FROM b;
     user_id
    ---------
     26712
    (1 row)
    
    dbadmin=> SELECT user_id, hash(user_id, 'a') FROM a;
     user_id |        hash
    ---------+---------------------
       26712 | 6622807832315002947
    (1 row)
    
    dbadmin=> SELECT user_id, hash(user_id, 'a') FROM b;
     user_id |        hash
    ---------+---------------------
     26712   | 3488463230331277936
    (1 row)
    
    dbadmin=> SELECT user_id, hash(user_id, 'a') FROM a
    dbadmin-> UNION ALL
    dbadmin-> SELECT user_id, hash(user_id, 'a') FROM b;
    ERROR 3429:  For 'UNION', types int and varchar are inconsistent
    
    dbadmin=> SELECT user_id, hash(user_id, 'a') FROM b
    dbadmin-> UNION ALL
    dbadmin-> SELECT user_id, hash(user_id, 'a') FROM a;
    ERROR 3429:  For 'UNION', types varchar and int are inconsistent
    DETAIL:  Columns: user_id and user_id
    

    Maybe the client tool is doing the conversion?

    dbadmin=> SELECT user_id::VARCHAR, hash(user_id, 'a') FROM a
    dbadmin-> UNION ALL
    dbadmin-> SELECT user_id, hash(user_id, 'a') FROM b;
     user_id |        hash
    ---------+---------------------
     26712   | 6622807832315002947
     26712   | 3488463230331277936
    
  • marcothesanemarcothesane - Select Field - Administrator

    To illustrate what Jim and Bryan were mentioning:
    Here's a test I just ran:

    \x
    SELECT
      HASH(26712                ) AS hash_INT
    , HASH(26712::NUMERIC(18)   ) AS hash_NUMERIC_18   
    , HASH(26712::NUMERIC(32)   ) AS hash_NUMERIC_32   
    , HASH(26712::NUMERIC(18, 4)) AS hash_NUMERIC_18_4
    , HASH(26712::FLOAT         ) AS hash_FLOAT         
    , HASH(26712::VARCHAR(256)  ) AS hash_VARCHAR_256  
    , HASH(26712::VARCHAR(64)   ) AS hash_VARCHAR_64
    , HASH(26712::VARCHAR(5)    ) AS hash_VARCHAR_5 
    , HASH(26712::CHAR(256)     ) AS hash_CHAR_256    
    , HASH(26712::CHAR(5)       ) AS hash_CHAR_5      
    ;
    -- out Expanded display is on.
    -- out -[ RECORD 1 ]-----+--------------------
    -- out hash_INT          | 734013455889912858
    -- out hash_NUMERIC_18   | 734013455889912858
    -- out hash_NUMERIC_32   | 4839844074090391799
    -- out hash_NUMERIC_18_4 | 6298167399894197585
    -- out hash_FLOAT        | 7382987690975279314
    -- out hash_VARCHAR_256  | 438488188464369242
    -- out hash_VARCHAR_64   | 438488188464369242
    -- out hash_VARCHAR_5    | 438488188464369242
    -- out hash_CHAR_256     | 438488188464369242
    -- out hash_CHAR_5       | 438488188464369242
    -- out 
    -- out Time: First fetch (1 row): 7.563 ms. All rows formatted: 7.594 ms
    

    So it's not only the value , but also its type, and sometimes also the size and precision, that you're hashing, that makes the difference....

    @hanans: Can you share with us the exact type of the columns in the table definition of the columns that you're hashing?

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited July 2019

    Heh heh. Must be a mix of data types that Vertica is auto coercing allowing the UNION to succeed :)

    dbadmin=> SELECT hash(26712::FLOAT, 'a')
    dbadmin-> UNION ALL
    dbadmin-> SELECT hash(26712::INT, 'a');
            hash
    ---------------------
      304723032549234641
     6622807832315002947
    (2 rows)
    
    dbadmin=> SELECT hash(26712::FLOAT, 'a')
    dbadmin-> UNION ALL
    dbadmin-> SELECT hash(26712::VARCHAR, 'a') ;
            hash
    ---------------------
      304723032549234641
     3488463230331277936
    (2 rows)
    
    dbadmin=> SELECT hash(26712::NUMERIC, 'a')
    dbadmin-> UNION ALL
    dbadmin-> SELECT hash(26712::FLOAT, 'a');
            hash
    ---------------------
     6622807832315002947
      304723032549234641
    (2 rows)
    
  • marcothesanemarcothesane - Select Field - Administrator

    I went through it all again, hashing with user_id and 'a':

    \x
    SELECT
      HASH(26712                , 'a') AS hash_with_INT
    , HASH(26712::NUMERIC(18)   , 'a') AS hash_with_NUMERIC_18   
    , HASH(26712::NUMERIC(32)   , 'a') AS hash_with_NUMERIC_32   
    , HASH(26712::NUMERIC(18, 4), 'a') AS hash_with_NUMERIC_18_4
    , HASH(26712::FLOAT         , 'a') AS hash_with_FLOAT         
    , HASH(26712::VARCHAR(256)  , 'a') AS hash_with_VARCHAR_256  
    , HASH(26712::VARCHAR(64)   , 'a') AS hash_with_VARCHAR_64
    , HASH(26712::VARCHAR(5)    , 'a') AS hash_with_VARCHAR_5 
    , HASH(26712::CHAR(256)     , 'a') AS hash_with_CHAR_256    
    , HASH(26712::CHAR(5)       , 'a') AS hash_with_CHAR_5      
    ;
    -- out Expanded display is on.
    -- out -[ RECORD 1 ]----------+--------------------
    -- out hash_with_INT          | 6622807832315002947
    -- out hash_with_NUMERIC_18   | 6622807832315002947
    -- out hash_with_NUMERIC_32   | 2877367361208514378
    -- out hash_with_NUMERIC_18_4 | 8063475701858845121
    -- out hash_with_FLOAT        | 304723032549234641
    -- out hash_with_VARCHAR_256  | 3488463230331277936
    -- out hash_with_VARCHAR_64   | 3488463230331277936
    -- out hash_with_VARCHAR_5    | 3488463230331277936
    -- out hash_with_CHAR_256     | 3488463230331277936
    -- out hash_with_CHAR_5       | 3488463230331277936
    

    If we look at hanans's output, his 1st and 2nd output row in the report, with 2877367361208514378 as the hash value must come from dwh.sm_user_profile, where user_id must be between a NUMERIC(19,n) to a NUMERIC(37,n), and the 3rd output row must come from agg.agg_sm_daily_users_stats_by_platform, where user_id is either an INT or between a NUMERIC(1) and a NUMERIC(18).

  • wow!you guys are awesome!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file