Options

Prevent specific column value-rows in a Vertica table

Hello, is there a way to prevent loading rows with specific column values into a Vertica table? For example I have a table that has data for multiple companies and if the column CompanyID=1 I want that data copied, if it is another value, it should not be.

I know it would be ideal to exclude CompanyID<>1 as part of the batch load, but we don't have that option. On relational databases I would create a trigger that would prevent the unwanted rows, how can it be done in Vertica? CompanyID is not the PK.

Answers

  • Options
    marcothesanemarcothesane - Select Field - Administrator

    I can imagine two ways - both not too effective, but there you are.

    This CSV file:

    dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ cat foo.csv
    id,org,nam
    1,1,arthur
    1,2,arthur
    1,3,ford
    1,1,trillian
    1,1,zaphod
    1,1,slartibartfast
    

    pumped through my d2l:

    dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ d2l -coldelcomma foo.csv
    CREATE TABLE foo (
      id  SMALLINT    NOT NULL
    , org SMALLINT    NOT NULL
    , nam VARCHAR(14) NOT NULL
    );
    

    modified and run with vsql:

    dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -af ../sql/crtb.sql
    CREATE TABLE foo (
      id  SMALLINT    NOT NULL
    , org SMALLINT    NOT NULL
    , nam VARCHAR(14) NOT NULL
    , CONSTRAINT foo_orgid_1 CHECK (org=1) ENABLED
    );
    CREATE TABLE
    

    Load:

    COPY foo
    FROM LOCAL '/home/dbadmin/data/foo.csv'
    SKIP 1
    DELIMITER ','
    EXCEPTIONS '/dev/tty'
    ;
    vsql:../sql/ltb.sql:7: ERROR 7230:  Check constraint 'dbadmin.foo.foo_orgid_1' (foo.org = 1) violation: 'org=2'
    

    Well, finds the error. But rejects everything:

    dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -c "select * from foo"
     id | org | nam 
    ----+-----+-----
    (0 rows)
    

    Next try, taking advantage of the fact that we can partition a table and drop its partitions very quickly:

    dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -af ../sql/crtb.sql
    DROP TABLE IF EXISTS foo;
    DROP TABLE
    CREATE TABLE foo (
      id  SMALLINT    NOT NULL
    , org SMALLINT    NOT NULL
    , nam VARCHAR(14) NOT NULL
    )
    PARTITION BY org;
    CREATE TABLE
    

    Loading this table:

    dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -af ../sql/ltb.sql
    COPY foo
    FROM LOCAL '/home/dbadmin/data/foo.csv'
    SKIP 1
    DELIMITER ','
    EXCEPTIONS '/dev/tty'
    ;
    
    Rows Loaded
    
           6
    (1 row)
    

    Loads everything, as expected.
    But now:

    dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -c "SELECT DISTINCT org FROM foo"
    
    org
    
    1
    2
    3
    (3 rows)
    

    We don't want org to contain anything but 1. So we quickly get rid of all other partitions:

    dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -c "SELECT DROP_PARTITIONS('foo',2,3)"; vsql -c "SELECT * FROM foo"
    
    DROP_PARTITIONS
    
    Partition dropped
    (1 row)
    
    id | org | nam
    ----+-----+----------------
    1 | 1 | arthur
    1 | 1 | slartibartfast
    1 | 1 | trillian
    1 | 1 | zaphod
    (4 rows)
    

    Yes, we load too much. But then, our parser is most efficient and will determine the values for org quicker than other methods. And you get rid of the unwanted stuff very quickly ...

  • Options
    VValdarVValdar Vertica Employee Employee

    In addition to the second solution of @marcothesane, you can provide your users a view on top of the table with the filter org=1 then do the drop partition whenever you want,

  • Options
    moshegmosheg Vertica Employee Administrator
    edited June 2023

    In addition, if the "org" field has a wide range of possible values, consider the following example:

    -- This script is responsible for loading data into the target table only if the "org" field input falls within the desired range.
    -- The desired "org" value in this exampe is 1.
    -- For optimal performance, Vertica recommends keeping the number of ungrouped partitions between 10 and 20, without exceeding 50.
    -- If the "org" field has a wide range of possible values, we will not partition the table by that field.
    -- To determine if a row falls within the desired field range during the load process, we introduce a column called 'is_loadable'.
    -- If the input falls within the range, the value of 'is_loadable' will be set to true; otherwise, it will be set to false.
    -- We use a CASE statement during the load to assign the value of 'is_loadable' based on the field range we want to load.
    -- Please adjust the conditions and values to match your desired range.
    -- Additional conditions can be added if necessary.
    -- After loading the data, we can drop the partition that have false values in the 'is_loadable' column.
    -- This operation will remove the rows that fall outside the desired field range.
    \set FILE_PATH `pwd`/foo.csv
    \set FILE_PATH '''':FILE_PATH''''
    CREATE TABLE foo (
      id  SMALLINT    NOT NULL,
      org SMALLINT    NOT NULL,
      name VARCHAR(14) NOT NULL,
      is_loadable BOOLEAN NOT NULL
    )
    PARTITION BY is_loadable;
    
    COPY foo
    ( id, org, name,
      is_loadable as case when org = 1 then TRUE else FALSE end
    )
    FROM LOCAL :FILE_PATH
    SKIP 1
    DELIMITER ',' ;
    
    SELECT DROP_PARTITIONS('foo',FALSE,FALSE);
    
    \echo 'Input file content:'
    \! cat foo.csv
    \echo '__________________________________'
    \echo 'Rows loaded, with the desired org field range:'
    SELECT * FROM foo ORDER BY id;
    

    Output:

    Input file content:
    id,org,name
    1,1,arthur
    2,2,arthur
    3,3,ford
    4,1,trillian
    5,1,zaphod
    6,1,slartibartfast
    
    __________________________________
    Rows loaded, with the desired org field range:
    SELECT * FROM foo ORDER BY id;
     id | org |      name      | is_loadable
    ----+-----+----------------+-------------
      1 |   1 | arthur         | t
      4 |   1 | trillian       | t
      5 |   1 | zaphod         | t
      6 |   1 | slartibartfast | t
    (4 rows)
    
  • Options
    marcothesanemarcothesane - Select Field - Administrator

    Just because we all like playing, and @mosheg has introduced the derived value in a COPY statement, as an AS clause in the column list:
    You can also formulate a load script to give org the same "wrong" value if the input is not equal to 1 - thus avoiding an additional column that is just needed to separate what you want from what you don't want.

    The content of the input file foo.csv is still:

    Input file content:
    id,org,name
    1,1,arthur
    2,2,arthur
    3,3,ford
    4,1,trillian
    5,1,zaphod
    6,1,slartibartfast
    

    Create table:

    CREATE TABLE IF NOT EXISTS foo (
      id   SMALLINT    NOT NULL
    , org  SMALLINT    NOT NULL
    , name VARCHAR(14) NOT NULL
    )
    PARTITION BY org;
    -- out CREATE TABLE
    -- out Time: First fetch (0 rows): 210.850 ms. All rows formatted: 210.895 ms
    

    Copy command:

    COPY foo (
      id
      , org_in FILLER INTEGER
      , org AS CASE org_in WHEN 1 THEN org_in ELSE -1 END
      , name
    )
    FROM LOCAL 'foo.csv'
    DELIMITER ',' SKIP 1
    REJECTED DATA 'foo.bad' EXCEPTIONS '/dev/tty'
    ;
    -- out  Rows Loaded 
    -- out -------------
    -- out            6
    -- out (1 row)
    -- out 
    -- out Time: First fetch (1 row): 636.264 ms. All rows formatted: 636.370 ms                                                                                 
    

    Control query:

    SELECT * FROM foo ORDER BY id; 
    -- out  id | org |      name      
    -- out ----+-----+----------------
    -- out   1 |   1 | arthur
    -- out   2 |  -1 | arthur
    -- out   3 |  -1 | ford
    -- out   4 |   1 | trillian
    -- out   5 |   1 | zaphod
    -- out   6 |   1 | slartibartfast
    -- out (6 rows)
    -- out 
    -- out Time: First fetch (6 rows): 215.823 ms. All rows formatted: 215.963 ms
    

    drop the one partition:

    SELECT DROP_PARTITIONS('foo',-1,-1);
    -- out   DROP_PARTITIONS  
    -- out -------------------
    -- out  Partition dropped
    -- out (1 row)
    -- out 
    -- out Time: First fetch (1 row): 214.895 ms. All rows formatted: 215.000 ms
    

    Final control query:

    SELECT * FROM foo;
    -- out  id | org |      name      
    -- out ----+-----+----------------
    -- out   1 |   1 | arthur
    -- out   4 |   1 | trillian
    -- out   5 |   1 | zaphod
    -- out   6 |   1 | slartibartfast
    -- out (4 rows)
    -- out 
    -- out Time: First fetch (4 rows): 215.297 ms. All rows formatted: 215.426 ms
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file