Prevent specific column value-rows in a Vertica table

milanpraha · June 2023

Hello, is there a way to prevent loading rows with specific column values into a Vertica table? For example I have a table that has data for multiple companies and if the column CompanyID=1 I want that data copied, if it is another value, it should not be.

I know it would be ideal to exclude CompanyID<>1 as part of the batch load, but we don't have that option. On relational databases I would create a trigger that would prevent the unwanted rows, how can it be done in Vertica? CompanyID is not the PK.

marcothesane · June 2023

I can imagine two ways - both not too effective, but there you are.

This CSV file:

dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ cat foo.csv
id,org,nam
1,1,arthur
1,2,arthur
1,3,ford
1,1,trillian
1,1,zaphod
1,1,slartibartfast

pumped through my d2l:

dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ d2l -coldelcomma foo.csv
CREATE TABLE foo (
  id  SMALLINT    NOT NULL
, org SMALLINT    NOT NULL
, nam VARCHAR(14) NOT NULL
);

modified and run with vsql:

dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -af ../sql/crtb.sql
CREATE TABLE foo (
  id  SMALLINT    NOT NULL
, org SMALLINT    NOT NULL
, nam VARCHAR(14) NOT NULL
, CONSTRAINT foo_orgid_1 CHECK (org=1) ENABLED
);
CREATE TABLE

Load:

COPY foo
FROM LOCAL '/home/dbadmin/data/foo.csv'
SKIP 1
DELIMITER ','
EXCEPTIONS '/dev/tty'
;
vsql:../sql/ltb.sql:7: ERROR 7230:  Check constraint 'dbadmin.foo.foo_orgid_1' (foo.org = 1) violation: 'org=2'

Well, finds the error. But rejects everything:

dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -c "select * from foo"
 id | org | nam 
----+-----+-----
(0 rows)

Next try, taking advantage of the fact that we can partition a table and drop its partitions very quickly:

dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -af ../sql/crtb.sql
DROP TABLE IF EXISTS foo;
DROP TABLE
CREATE TABLE foo (
  id  SMALLINT    NOT NULL
, org SMALLINT    NOT NULL
, nam VARCHAR(14) NOT NULL
)
PARTITION BY org;
CREATE TABLE

Loading this table:

dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -af ../sql/ltb.sql
COPY foo
FROM LOCAL '/home/dbadmin/data/foo.csv'
SKIP 1
DELIMITER ','
EXCEPTIONS '/dev/tty'
;

Rows Loaded

       6
(1 row)

Loads everything, as expected.
But now:

dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -c "SELECT DISTINCT org FROM foo"

org

1
2
3
(3 rows)

We don't want org to contain anything but 1. So we quickly get rid of all other partitions:

dbadmin@gessnerm-HP-ZBook-15-G3:~/data$ vsql -c "SELECT DROP_PARTITIONS('foo',2,3)"; vsql -c "SELECT * FROM foo"

DROP_PARTITIONS

Partition dropped
(1 row)

id | org | nam
----+-----+----------------
1 | 1 | arthur
1 | 1 | slartibartfast
1 | 1 | trillian
1 | 1 | zaphod
(4 rows)

Yes, we load too much. But then, our parser is most efficient and will determine the values for org quicker than other methods. And you get rid of the unwanted stuff very quickly ...

VValdar · June 2023

In addition to the second solution of @marcothesane, you can provide your users a view on top of the table with the filter org=1 then do the drop partition whenever you want,

mosheg · June 2023

In addition, if the "org" field has a wide range of possible values, consider the following example:

-- This script is responsible for loading data into the target table only if the "org" field input falls within the desired range.
-- The desired "org" value in this exampe is 1.
-- For optimal performance, Vertica recommends keeping the number of ungrouped partitions between 10 and 20, without exceeding 50.
-- If the "org" field has a wide range of possible values, we will not partition the table by that field.
-- To determine if a row falls within the desired field range during the load process, we introduce a column called 'is_loadable'.
-- If the input falls within the range, the value of 'is_loadable' will be set to true; otherwise, it will be set to false.
-- We use a CASE statement during the load to assign the value of 'is_loadable' based on the field range we want to load.
-- Please adjust the conditions and values to match your desired range.
-- Additional conditions can be added if necessary.
-- After loading the data, we can drop the partition that have false values in the 'is_loadable' column.
-- This operation will remove the rows that fall outside the desired field range.
\set FILE_PATH `pwd`/foo.csv
\set FILE_PATH '''':FILE_PATH''''
CREATE TABLE foo (
  id  SMALLINT    NOT NULL,
  org SMALLINT    NOT NULL,
  name VARCHAR(14) NOT NULL,
  is_loadable BOOLEAN NOT NULL
)
PARTITION BY is_loadable;

COPY foo
( id, org, name,
  is_loadable as case when org = 1 then TRUE else FALSE end
)
FROM LOCAL :FILE_PATH
SKIP 1
DELIMITER ',' ;

SELECT DROP_PARTITIONS('foo',FALSE,FALSE);

\echo 'Input file content:'
\! cat foo.csv
\echo '__________________________________'
\echo 'Rows loaded, with the desired org field range:'
SELECT * FROM foo ORDER BY id;

Output:

Input file content:
id,org,name
1,1,arthur
2,2,arthur
3,3,ford
4,1,trillian
5,1,zaphod
6,1,slartibartfast

__________________________________
Rows loaded, with the desired org field range:
SELECT * FROM foo ORDER BY id;
 id | org |      name      | is_loadable
----+-----+----------------+-------------
  1 |   1 | arthur         | t
  4 |   1 | trillian       | t
  5 |   1 | zaphod         | t
  6 |   1 | slartibartfast | t
(4 rows)

marcothesane · June 2023

Just because we all like playing, and @mosheg has introduced the derived value in a COPY statement, as an AS clause in the column list:
You can also formulate a load script to give org the same "wrong" value if the input is not equal to 1 - thus avoiding an additional column that is just needed to separate what you want from what you don't want.

The content of the input file foo.csv is still:

Input file content:
id,org,name
1,1,arthur
2,2,arthur
3,3,ford
4,1,trillian
5,1,zaphod
6,1,slartibartfast

Create table:

CREATE TABLE IF NOT EXISTS foo (
  id   SMALLINT    NOT NULL
, org  SMALLINT    NOT NULL
, name VARCHAR(14) NOT NULL
)
PARTITION BY org;
-- out CREATE TABLE
-- out Time: First fetch (0 rows): 210.850 ms. All rows formatted: 210.895 ms

Copy command:

COPY foo (
  id
  , org_in FILLER INTEGER
  , org AS CASE org_in WHEN 1 THEN org_in ELSE -1 END
  , name
)
FROM LOCAL 'foo.csv'
DELIMITER ',' SKIP 1
REJECTED DATA 'foo.bad' EXCEPTIONS '/dev/tty'
;
-- out  Rows Loaded 
-- out -------------
-- out            6
-- out (1 row)
-- out 
-- out Time: First fetch (1 row): 636.264 ms. All rows formatted: 636.370 ms

Control query:

SELECT * FROM foo ORDER BY id; 
-- out  id | org |      name      
-- out ----+-----+----------------
-- out   1 |   1 | arthur
-- out   2 |  -1 | arthur
-- out   3 |  -1 | ford
-- out   4 |   1 | trillian
-- out   5 |   1 | zaphod
-- out   6 |   1 | slartibartfast
-- out (6 rows)
-- out 
-- out Time: First fetch (6 rows): 215.823 ms. All rows formatted: 215.963 ms

drop the one partition:

SELECT DROP_PARTITIONS('foo',-1,-1);
-- out   DROP_PARTITIONS  
-- out -------------------
-- out  Partition dropped
-- out (1 row)
-- out 
-- out Time: First fetch (1 row): 214.895 ms. All rows formatted: 215.000 ms

Final control query:

SELECT * FROM foo;
-- out  id | org |      name      
-- out ----+-----+----------------
-- out   1 |   1 | arthur
-- out   4 |   1 | trillian
-- out   5 |   1 | zaphod
-- out   6 |   1 | slartibartfast
-- out (4 rows)
-- out 
-- out Time: First fetch (4 rows): 215.297 ms. All rows formatted: 215.426 ms

Prevent specific column value-rows in a Vertica table

Answers

Leave a Comment