Filtering Data While Loading into Vertica
Suppose you have a CSV file and you want to copy some, but not all, of its contents into a Vertica table. There are two ways you can do this:
• Use the SKIP keyword with COPY.
• Use the head
or tail
Linux command.
Let's see how this works.
The Data
Here's a small example data set, a CSV file named students.csv, that contains the following information for 20 students:
Smith, John, 1977, 1981, Bowers Hall Brown, Tim, 1978, 1982, Blue Hall Howe, Mary, 1976, 1980, White Hall Kelly, Sarah, 1977, 1981, Blue Hall Jones, William, 1977, 1981, Johnson Hall Brady, Mark, 1976, 1980, Madison Hall Black, Howard, 1975, 1979, Bowers Hall King, Martha, 1976, 1980, White Hall McCoy, Keith, 1978, 1982, Madison Hall McDonald, Susan, 1977, 1981, Johnson Hall Williams, Joe, 1975, 1979, Bowers Hall Johnson, Julie, 1978, 1982, White Hall Barry, Polly, 1976, 1980, Blue Hall Wilson, Martin, 1975, 1979, Johnson Hall Adams, Harry, 1976, 1980, White Hall Ball, Terry, 1977, 1981, Bowers Hall Chan, Ty, 1978, 1982, Johnson Hall David, Donald, 1975, 1979, Madison Hall Emory, Fletcher, 1976, 1980, White Hall Flaherty, Finbarr, 1977 1981, Blue Hall
To load this data into Vertica, define a table as follows:
=> CREATE TABLE students ( Last VARCHAR(64), First VARCHAR(64), EnrollYear INT, GradYear INT, Dorm VARCHAR(64) );
Now you can load the data from the CSV file into the students table:
=> COPY students FROM '/home/dbadmin/students.csv' DELIMITER ','; Rows loaded ------------- 20 (1 row) => SELECT * FROM students; Last | First | EnrollYear | GradYear | Dorm ----------+-----------+------------+----------+--------------- Howe | Mary | 1976 | 1980 | White Hall King | Martha | 1976 | 1980 | White Hall McCoy | Keith | 1978 | 1982 | Madison Hall Williams | Joe | 1975 | 1979 | Bowers Hall Barry | Polly | 1976 | 1980 | Blue Hall Emory | Fletcher | 1976 | 1980 | White Hall Flaherty | Finbarr | 1977 | 1981 | Blue Hall Jones | William | 1977 | 1981 | Johnson Hall Brady | Mark | 1976 | 1980 | Madison Hall Black | Howard | 1975 | 1979 | Bowers Hall Wilson | Martin | 1975 | 1979 | Johnson Hall David | Donald | 1975 | 1979 | Madison Hall Smith | John | 1977 | 1981 | Bowers Hall Brown | Tim | 1978 | 1982 | Blue Hall Kelly | Sarah | 1977 | 1981 | Blue Hall McDonald | Susan | 1977 | 1981 | Johnson Hall Johnson | Julie | 1978 | 1982 | White Hall Adams | Harry | 1976 | 1980 | White Hall Ball | Terry | 1977 | 1981 | Bowers Hall Chan | Ty | 1978 | 1982 | Johnson Hall (20 rows)
Using COPY to Filter Rows
Let's create a new table and copy the same table except for the first 10 entries. To omit the first 10 entries in the CSV file, use the SKIP keyword. We'll name the new table students_last_ten:
=> CREATE TABLE students_last_ten ( Last VARCHAR(64), First VARCHAR(64), EnrollYear INT, GradYear INT, Dorm VARCHAR(64) ); => COPY students_last_ten FROM '/home/dbadmin/students.csv' DELIMITER ',' SKIP 10; Rows Loaded ------------- 10 (1 row) => SELECT * FROM students_last_ten; Last | First | EnrollYear | GradYear | Dorm ----------+-----------+------------+----------+------------- Williams | Joe | 1975 | 1979 | Bowers Hall Barry | Polly | 1976 | 1980 | Blue Hall Emory | Fletcher | 1976 | 1980 | White Hall Flaherty | Finbarr | 1977 | 1981 | Blue Hall Johnson | Julie | 1978 | 1982 | White Hall Adams | Harry | 1976 | 1980 | White Hall Ball | Terry | 1977 | 1981 | Bowers Hall Chan | Ty | 1978 | 1982 | Johnson Hall Wilson | Martin | 1975 | 1979 | Johnson Hall David | Donald | 1975 | 1979 | Madison Hall (10 rows)
Using Linux to Filter Rows
You can also use the Linux head
and tail
commands to write certain rows from the top or the bottom of a file to a Vertica table. For this next example, we'll load the first 10 rows from the student data.
=> CREATE TABLE students_last_ten ( Last VARCHAR(64), First VARCHAR(64), EnrollYear INT, GradYear INT, Dorm VARCHAR(64) ); CREATE TABLE => \q
Let's run the tail
command to verify that it gives us the last 10 rows. Note that running the head
command would give us the first 10 rows.
$ tail -10 /home/dbadmin/students.csv Williams, Joe, 1975, 1979, Bowers Hall Johnson, Julie, 1978, 1982, White Hall Barry, Polly, 1976, 1980, Blue Hall Wilson, Martin, 1975, 1979, Johnson Hall Adams, Harry, 1976, 1980, White Hall Ball, Terry, 1977, 1981, Bowers Hall Chan, Ty, 1978, 1982, Johnson Hall David, Donald, 1975, 1979, Madison Hall Emory, Fletcher, 1976, 1980, White Hall Flaherty, Finbarr, 1977, 1981, Blue Hall
Now we'll rerun the tail
command and pipe that output into the Vertica table we created.
$ head -10 /home/dbadmin/students.csv | vsql -d vmart -U dbadmin -c "COPY students_last_ten FROM LOCAL STDIN DELIMITER ',' ;" Rows Loaded ------------- 10 (1 row)
Let's make sure the data was actually loaded into the student_last_ten table.
$ vsql => SELECT * FROM students_last_ten; Last | First | EnrollYear | GradYear | Dorm ----------+----------+------------+----------+--------------- Black | Howard | 1975 | 1979 | Bowers Hall Brady | Mark | 1976 | 1980 | Madison Hall Jones | William | 1977 | 1981 | Johnson Hall Brown | Tim | 1978 | 1982 | Blue Hall Kelly | Sarah | 1977 | 1981 | Blue Hall McDonald | Susan | 1977 | 1981 | Johnson Hall Smith | John | 1977 | 1981 | Bowers Hall Howe | Mary | 1976 | 1980 | White Hall King | Martha | 1976 | 1980 | White Hall McCoy | Keith | 1978 | 1982 | Madison Hall (10 rows)
For More Information
• COPY Parameters in the Vertica Documentation
• head
Man Page
• tail
Man Page