Reading GZIP files from S3 into Vertica
[Deleted User]
Administrator
Vertica supports reading GZIP files directly from S3. Here are the commands to make that happen:
$ cd /opt/vertica/sdk/examples/ $ make $ vsql -f FilterFunctions.sql (Optional) /*example provided by Vertica */ dbadmin=> CREATE LIBRARY GZipLib AS '/opt/vertica/sdk/examples/build/GZipLib.so'; dbadmin=> CREATE FILTER GZip AS LANGUAGE 'C++' NAME 'GZipUnpackerFactory' LIBRARY GZipLib; dbadmin=> COPY DATA SOURCE S3(bucket='s3://external-data/201007*', delimiter='|') FILTER gzip() SKIP 1;
When you install Vertica, the sample file FilterFunctions.sql (available in GitHub)is stored in the /opt/vertica/sdk/examples/ directory. This file provides a step-by-step guide of how to use the User-Defined Filter Functions with sample data. Optionally, you can use this file to test if the UDFs are working with the filters installed as part of the C++ SDK.
0
Comments
Could please elaborate about it
Hi Nitin - The commands above demonstrate how to load gzipped files (gzipped through the S3 loader) directly into Vertica. In /opt/vertica/sdk/examples/FilterFunctions, we provide the BZIP and GZIP filters, which once installed gives you the capability to leverage this feature.
Hope this helps!
This is very useful. I was able to successfully load 1 gzipped csv file from a folder in a bucket on s3. However, when trying to load multiple files it seems to only pick-up one of them. Is there something needed to accept multiple gzipped files?
Also sometimes I get this error:
[VP001][5861] [Vertica][VJDBC](5861) ERROR: Error calling plan() in User Function S3SourceFactory at [src/S3.cpp:265], error code: 0, message: The request signature we calculated does not match the signature you provided. Check your key and signing method. java.lang.RuntimeException: com.vertica.support.exceptions.NonTransientException: [Vertica][VJDBC](5861) ERROR: Error calling plan() in User Function S3SourceFactory at [src/S3.cpp:265], error code: 0, message: The request signature we calculated does not match the signature you provided. Check your key and signing method.
Can't seem to figure what causes this as it'll happen randomly for the same query.
@TrovadorSF - Can you post the SQL you used to try to load more than one file?
This error is totally crap. sometimes am able to load all my .gzip file and sometimes its not working just after 5 mins. what the heck is this error for?
Error calling plan() in User Function S3SourceFactory at [src/S3.cpp:383], error code: 0, message: Access Denied
S3 UDSource was deprecated in Vertica 9.1 and removed in 9.1.1 Mainly 'cause it sucked.
See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/BulkLoadCOPY/AWSLibrary/importAWSlibrary.htm
Hey @Jim_Knicely , thats not working even. i tried
@sarojrout ,
You have check few things when you use the new Copy from s3 syntax:
1. Set AWSAuth parameter as below:
ALTER SESSION SET AWSAuth=‘AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY’;
Make sure you have the right region set as below: (change the region name accordingly)
ALTER SESSION SET UDPARAMETER FOR awslib aws_region='us-west-2';
Make sure your bucket name is defined in the vertica storage location:
select * from storage_locations where location_path ilike '%your_bucket_name%';
--If not defined then create and grant to the vertica user from where you run the copy
CREATE LOCATION 's3://xxxx...' SHARED USAGE 'USER' LABEL 's3user';
--Grant the location to the vertica user:
GRANT READ on LOCATION 's3://xxx...' to sys_vertica_xsbg;
Make sure you are able to list the files under the bucket from the OS command prompt as shown below:
[vertica-2a-datanode1 ~]$ aws s3 ls s3://xxx...
You should be able to see the files under your s3 bucket
Now you can try the copy statement.
COPY SCHEMA_NAME.TABLE_NAME FROM 's3://xxx.../file_name.gz' FILTER GZip() DELIMITER E'\t' NULL AS '\N' REJECTMAX 100 skip 1 DIRECT COMMIT;
Hope this helps!!!
Thanks
Dilip Rachamalla
@sarojrout - Were you able to follow @DilipKumar recommendations?