Vertica Integration with Hadoop
Hi Expertise,
Currently I am facing issue with Vertica Integration with hadoop.
We have requirement to get data from hadoop(HDFS ).
Is there any way to export data from there ?
Can provide simple steps to configure in Vertica 9.1 Enterprise ?
Also we have to sync up data with HDFS ,whatever the data is updated in HDFS same changes updated in vertica as well.
Thanks in advance !!
Mujeef
Best Answers
-
LenoyJ - Select Field - Employee
Referring you to our documentation that goes through all this in much detail: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/HadoopIntegrationGuide/HadoopIntegrationGuide.htm
Basically:
- Locate your HDFS config files: core-site.xml and hdfs-site.xml.
- Copy them over to every Vertica node you have at the same path. Say, /home/dbadmin
Run vsql and set it to the directory:
ALTER DATABASE dbname SET HadoopConfDir = '/home/dbadmin';
If you are on >8.1, verify if it is okay with:
SELECT VERIFY_HADOOP_CONF_DIR();
If you are on >9.2, a better function is available to verify:
SELECT HDFS_CLUSTER_CONFIG_CHECK();
Then use the hdfs:/// scheme with COPY to get data from HDFS. See: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/HadoopIntegrationGuide/libhdfs/HdfsURL.htm
5 -
Mujeef ✭
Hi Lenoy,
Thank you very much for your response.
I followed the above all steps successfully.
While running below statement it got failed:
CREATE EXTERNAL TABLE public.test_data (col1 varchar(200),col2 varchar(200),col3 varchar(200),col4 varchar(200))
AS COPY FROM 'hdfs:///test/.txt' DELIMITER ',';
Error :Vertica]VJDBC ERROR: Failed to glob [hdfs:///mujju/.txt] because of error: hdfs:///test/: listStat failed; error: End of fileCan you suggest ?
Thanks,
Mujeef0 -
SruthiA Administrator
@Mujeef what you are seeing with regards to EXPORT TO PARQUET is expected.
The destination directory for the Parquet files. The directory must not exist, and the current user must have permission to write it. The destination can be on HDFS, S3, or an NFS mount point on the local file system.
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Statements/EXPORTTOPARQUET.htm
please find the below link which provides way to export data from vertica to either s3, local file system etc
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/HadoopIntegrationGuide/NativeFormats/ExportingData.htm5
Answers
Could you please run sample CURL test to see if we can access hadoop cluster? Please replace the filename, path and namenode value accordingly.
curl -i -L http://hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN
Hi Sruthi,
I unable to open the link you provided.
Do I need to set passwordless connection between Vertica Node to Hadoop cluster ?
It will be helpful if you provide steps to configure vertica with hadoop.
Thanks,
Mujeef
Hi Mujeef,
It is not a link. please replace Namenode , path and file name in the below CURL command and run it on vertica node and share me the output.
curl -i -L http://hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN
Hi Sruthi,
We have CDH 6.1.1 , can we integrated with Vertica 9.1 or 9.2 ?
Thanks,
Mujeef
Hi Mujeef,
The latest we support and officially tested is 6.0 on 9.2. You can integrate it with 6.1.1 and it should work in general. if you face any issues, please open a support case
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SupportedPlatforms/IntegrationsForHadoop.htm
Sruthi
Hi Shruti,
Thanks for your quick response.
Just here one question
I need to real time synchronization with Hadoop to Vertica.
I mean Whatever the changes don on HDFS same will update on Vertica database as well.
Is there a way we can archive this ?
Thanks,
Mujeef Shaikh
Hello Mujeef,
Using similar steps to those described above, in order to expose the HDFS data to Vertica
ie: CREATE EXTERNAL TABLE test_data (col1 varchar(200),col2 varchar(200),col3 varchar(200),col4 varchar(200)) AS COPY FROM 'hdfs:///test/*.txt' DELIMITER ',';
This would mean data in any "txt" file in the "/test" HDFS folder will be automatically visible via the table "test_data". Also, if another txt file is added to that folder or existing files are dropped or amended, these changes will be reflected in the table.
Hi mflower,
Thanks for your valuable guidance.
So when run incremental load on same tables from HDFS?
How it will perform?
Every time it will be the full loading ?
Can you please your more guidance on this ?
Thanks,
Mujeef
Hello Mujeef,
The external table provides a link to the external data source.
Here's a link to more information: https://my.vertica.com/docs/latest/HTML/Content/Authoring/AdministratorsGuide/Tables/ExternalTables/CreatingExternalTables.htm
Hello Mflower,
I had done successful connection with CDH 16.1.1 with Vertica 9.2.1
Also I am able to access data from CDH to Vertica.
I am exporting data from Vertica to Hadoop using "Export to Paraquet " statement .
Is there any other way to export data from Vertica to Hadoop?
Because every time I am performing "Export to Vertica" it is creating new directory to export the object.
I am using below statement for the reference :-
EXPORT TO PARQUET(file = 'webhdfs://x.x.x.x:9870/data/',
fileMode='432', dirMode='rwxrw-r-x')
AS SELECT * FROM hadoop.test_data9;
Can you please provide a better way we can do it. ?
Thanks in advance.
Mujeef Shaikh