Importing from PARQUET files into Vertica from several nodes
Hello
I found some posts about this topic but as they are quite old, I prefer opening a new one.
Our Vertica version is v8.1.
We export data to PARQUET files, but we are not able to import them back with the COPY commands described in the documentation to handle several nodes
1) with all nodes between parenthesis
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON (db_node0001,db_node0002, db_node0003) PARQUET;
--> error : [Code: 6764, SQL State: 58V01] [Vertica]VJDBC ERROR: Error opening file [/data/DEVTools/Archive/fileexport_20180115_130128Zhh8C5NT/15a5612e-db_node0003-139971123980032.parquet] for read: No such file or directory
2) with ON ANY NODE
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON any node PARQUET;
**--> error : [Code: 6764, SQL State: 58V01] [Vertica]VJDBC ERROR: Error opening file [/data/DEVTools/Archive/CR14307891_CRM.fileexport_20180115_130128Zhh8C5NT/15a5612e-db_node0003-139971123980032.parquet] for read: No such file or directory
3) If we issue the 3 COPY commands, it works
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON db_node0001 PARQUET;
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON db_node0002 PARQUET;
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON db_node0003 PARQUET;
Does someone have a clue about what could be wrong with the 2 first commands ?
All are issued with a user who is entitled to read all the files on all the nodes
Thanks
Isa
Comments
I don't know why, the '*'have disappeared from the file names in my above post
Pls read them as :
Can you post the EXPORT TO PARQUET command you used?
Here it is :
EXPORT TO PARQUET(directory = '/data/DEVTools/Archive/fileexport_20180115_130128') AS SELECT * FROM staging.tbname)
On the node where the export is executed, the directory is created with the given name, but on other nodes, a suffix is added by the Export process .
This is why we put a "*" (star) in the directory name, to be sure we don't miss this suffix at import time
Try not telling Vertica a node...
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128*/*.parquet' PARQUET;
The "ON ANY NODE" parameter specifies that the source file to load is available on all of the nodes. In your case, each individual parquet file is only on one node.
Yes, this imports only data from one node (the current one).
The table tbname contains 24171655 rows.
When executing the 3 COPY commands we get :
[COPY - 8052971 rows, 10.325 secs] Command processed
[COPY - 8058623 rows, 11.775 secs] Command processed
[COPY - 8060061 rows, 17.522 secs] Command processed
select count(*) from staging.tbname returns 24171656 rows
When executing the command you suggest, we get
11:44:16 [COPY - 8058623 rows, 12.570 secs]
this is confirmed with a select count(*)
Hi,
I think you have no choice but to execute three copy commands because the files are not the same on each node, thus the ON ANY NODE option is out. Plus the ERROR TOLERANCE feature is not available for PARQUET files.
You could create an NFS mount accessible to each node and export to parquet there, then the ON ANY NODE parameter will work.
Assuming there are files on each node, I believe the following command will work for you as a single COPY:
Just make sure all of those directory paths are correct..
Perfect ! It works fine and quicker than the 3 commands !
Thanks a lot
... but in any case we need to know all the nodes...it is not a very generic command... we'll live with it.