Storing source file names when loading data

Bryan_HBryan_H Vertica Employee Administrator

You can manipulate fields when loading data with the COPY command, but did you know you can also store the source file name? Let's see how:

Suppose I have hourly data files in the pattern hourlyXX.csv where XX is the hour from 00-23. I want to load them into a single staging table.

Let's define the staging table:

dbadmin=> create table hourly (filename VARCHAR, symbol VARCHAR, tm TIME, q1 int, q2 int, q3 int);

COPY data with "filename" set to the source file name or CURRENT_LOAD_SOURCE() and use a wildcard to match all "hourly" files:

dbadmin=> COPY hourly (filename AS CURRENT_LOAD_SOURCE(), symbol, tm, q1, q2, q3) FROM '/tmp/hourly*.csv';

Now let's see what we loaded:

dbadmin=> select * from hourly ;
filename | symbol | tm | q1 | q2 | q3
-------------------+--------+----------+----+----+----
/tmp/hourly00.csv | AAA | 00:01:02 | 1 | 2 | 3
/tmp/hourly00.csv | BBB | 00:02:04 | 1 | 2 | 3
/tmp/hourly01.csv | AAA | 01:01:02 | 1 | 2 | 3
/tmp/hourly01.csv | BBB | 01:02:04 | 1 | 2 | 3
/tmp/hourly02.csv | AAA | 02:01:02 | 1 | 2 | 3
/tmp/hourly02.csv | BBB | 02:02:04 | 1 | 2 | 3

This is another way Vertica can help process data on load and avoid multiple staging steps. Let us know how it works for you!

Reference:
https://www.vertica.com/docs/10.1.x/HTML/Content/Authoring/SQLReferenceManual/Functions/CurrentLoad/CURRENT_LOAD_SOURCE.htm

Tagged:
Sign In or Register to comment.