Load data from Amazon S3 bucket to Vertica

Hi, Requirement: Load data from Amazon S3 bucket to Vertica Database. Current Setup: Three node cluster hosted on AWS Version: Vertica 6.1.1 OS: Linux CentOS 6 Database and tables created. Data loaded through COPY command from vsql client Some of the data files are resided on S3, this has to be loaded to Database. Installed 'fuse' and mounted the S3 on one of the clusters(linux server) on which the Vertica is hosted. Using s3cmd command I am able to view all my files which is residing in the bucket. Issue: When I run the below command to load the data from S3, it says the file is not available Seems like it is not recognizing the S3 filesystem Command used to load data from S3: COPY SAM_REP_SCHEMA.UN_MOB_MM_IOS_DATA FROM 's3://samevalbucket/mediadata/UN_MOB_MM_IOS_DATA _25Apr13.CSV' DELIMITER '|' ENCLOSED BY '"' NO ESCAPE NULL AS '\N' EXCEPTIONS '/tmp/UN_MOB_MM_IOS_DATA_25Apr13_Excep.txt' REJECTED DATA '/tmp/UN_MOB_MM_IOS_DATA_25Apr13_Rejct.txt' ON v_samdb_node0001 ; Error: ERROR 2886: Could not open file [s3n://samevalbucket/mediadata/UN_MOB_MM_IOS_DATA _25Apr13.CSV] for reading; No such file or directory Questions: Is there any variable I need to set before loading from S3? Is this the right command to load data from S3 into Vertica? Please suggest...

Comments

  • Hi Two suggestions. Either you need to fetch the file from s3 (using s3cmd) and pipe it via STDIN into your copy command. Something like S3cmd get s3://inputbucket/inputfile.csz | copy into tablename from STDIN Or you need to refer to your file in s3 using it's URL rather than it's S3 name. Something like this COPY into mytable FROM 'https://s3.amazonaws.com/inputbucket/inputfile'
  • Hi all, Peter, thanks for the suggestions. Piping into stdin is a great simple solution; a specific command line might be $ s3cmd get s3://inputbucket/inputfile.csz | vsql -c "copy tablename from stdin;" Vertica does not currently support HTTP as a builtin. However, we have an SDK for extending COPY, and we ship a number of examples with Vertica, including one that can download over HTTP. If you look in the "/opt/vertica/sdk/examples/" directory on any machine in your cluster, the README.txt file in that directory has more details. Regarding the filesystem, though, I'm not sure if you're familiar with FUSE in general? At the FUSE and Linux levels in particular, there's no such thing as an "S3 filesystem"; that's not how FUSE works. What FUSE does is allow you to mount remote volumes as paths in your *local* filesystem. I'm not certain which S3 FUSE implementation you're using, but the process should generally look something like this: https://code.google.com/p/s3fs/wiki/FuseOverAmazon In particular, from that page, note the command: /usr/bin/s3fs mybucket /mnt I have not used FUSE and S3 together myself, but based on reading that page, I would expect the file to then be available at '/mnt/mediadata/UN_MOB_MM_IOS_DATA _25Apr13.CSV', or some similar path. If you have further questions about FUSE or the Amazon FUSE connector, you will probably get more of a response in one of their forums. As a rule of thumb: Log into the server as dbadmin. If you can't see the file using the standard command-line "ls" and "cat" utilities, then Vertica can't see it either.
  • Hi, Thanks for the reply, I tired with the command as below, but the load is not happening s3cmd get 's3://samevalbucket/mediadata/UN_MOB_MM_IOS_DATA_25Apr13.CSV' | vsql SAMDB -U -w -c "COPY SAM_REP_SCHEMA.UN_MOB_MM_IOS_DATA_TMP FROM STDIN DELIMITER '|' NO ESCAPE NULL AS '\N' EXCEPTIONS '/tmp/iOS_Except.txt' REJECTED DATA '/tmp/iOS_Reject.txt' DIRECT;" All the rows are rejected and below is the error captured in exception file: COPY: Input record 1 has been rejected (Invalid integer format 'File s3://samevalbucket/mediadata/UN_MOB_MM_IOS_DATA_25Apr13.CSV saved as './UN_MOB_MM_IOS_DATA_25Apr13.CSV' (106252427 bytes in 2.8 seconds, 35.80 MB/s)' for column 1 (ID)). Please see /tmp/iOS_Reject.txt, record 1 for the rejected record. Seems like it is taking the terminal output of the 's3cmd get' command. I even tried with s3cmd -q (quite mode), but still it is not loading and in this case even the rejection and exception files are also not created. Please let me know if I am missing anything. And yes I referred the link: https://code.google.com/p/s3fs/wiki/FuseOverAmazon and followed: http://qugstart.com/blog/linux/how-to-mount-an-amazon-s3-bucket-as-virtual-drive-on-centos-5-2 to mount S3 on Vertica linux box P.S:I am able to get/copy the file from S3 to local directory with "s3cmd get 's3://samevalbucket/mediadata/UN_MOB_MM_IOS_DATA_25Apr13.CSV'" But not able to process it in Vertica - Manju
  • Hi Manju, you're correct, in the situation above, vsql is reading from its stdin / s3cmd's stdout. My apologies, I'm not as familiar with the S3 tools; I assumed that "s3cmd get" outputs the file data to stdout by default. Looking around online, it looks like you can get "s3cmd get" to output to stdout by appending a hyphen to the command?: s3cmd get 's3://samevalbucket/mediadata/UN_MOB_MM_IOS_DATA_25Apr13.CSV' - | vsql SAMDB -U -w -c "COPY SAM_REP_SCHEMA.UN_MOB_MM_IOS_DATA_TMP FROM STDIN DELIMITER '|' NO ESCAPE NULL AS '\N' EXCEPTIONS '/tmp/iOS_Except.txt' REJECTED DATA '/tmp/iOS_Reject.txt' DIRECT;" If that doesn't work, perhaps others here know how to get "s3cmd get" to output to stdout? Also, if you did mount S3 locally, you should be able to use that as well. But you will have to continue to follow the instructions at those URLs in order to figure out the correct path to give to COPY. Your previous example used a path that started with 's3://', which is correct for s3cmd, but which isn't the local filesystem path that FUSE generates for Vertica.
  • s3cmd get s3://bucketname/filename temp.txt cat temp.txt | vsql -c "copy tablename from STDIN" rm temp.txt
  • What options and their benefits/limitations are to upload data to Vertica? Is S3 the only method? As I understand S3 also costs something. Is there some good way to export data from EC2 PostgreSQL instance to Vertica on EC2, that would be quick and not very costly?

  • Why don't just mount the S3 bucket ? 

     

    See example:

    -create mount point /s301 and mount.

    [root@ip-172-31-12-40 /]# /usr/bin/s3fs cf-templates-aghm0qs90bp9-sa-east-1 /s301  -o allow_other -o use_cache=/tmp

    -- load / create a csv file on the s3 bucket
    [dbadmin@ip-172-31-12-40 ~]$ echo "1,adrian" > /s301/data.csv
    [dbadmin@ip-172-31-12-40 ~]$ cat /s301/data.csv
    1,adrian
    [dbadmin@ip-172-31-12-40 ~]$ df -h /s301/
    Filesystem Size Used Avail Use% Mounted on
    s3fs 256T 0 256T 0% /s301

    - create table and load data from S3 bucket.

    [dbadmin@ip-172-31-12-40 ~]$ vsql
    Password:
    Welcome to vsql, the Vertica Analytic Database interactive terminal.

    Type: \h or \? for help with vsql commands
    \g or terminate with semicolon to execute query
    \q to quit

    dbadmin=> copy test from '/s301/data.csv' delimiter ',' direct;
    Rows Loaded
    -------------
    1
    (1 row)

    - the rest is with you ! Is importnat that ou have your S3 bucket on the same Region as your AWS instances.

     

  • Loading data to Vertica from S3 bucket become very simple. At this point you do not need to have data on node. So there is no need to do anything with s3getfile / s3cmd / s3fs. Starting from version 7.2.2 Vertica shipped with Vertica Library for Amazon Web Services. This library will allow you to use URL to your data file directly in the COPY command.

     

    I recommend to revisit the architecture of our loader. I am sure you will be able to simplify your current environment significantly. Plus your COPY commands will be much more simple too.

     

    I blogged about it recently. Here is the link to a post which will provide you with all details you need: http://www.dbjungle.com/loading-data-from-aws-s3-bucket-to-vertica-made-easy/

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file