Importing from PARQUET files into Vertica from several nodes

Isabelle_Vela · January 2018

Hello
I found some posts about this topic but as they are quite old, I prefer opening a new one.
Our Vertica version is v8.1.
We export data to PARQUET files, but we are not able to import them back with the COPY commands described in the documentation to handle several nodes

1) with all nodes between parenthesis
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON (db_node0001,db_node0002, db_node0003) PARQUET;

--> error : [Code: 6764, SQL State: 58V01] [Vertica]VJDBC ERROR: Error opening file [/data/DEVTools/Archive/fileexport_20180115_130128Zhh8C5NT/15a5612e-db_node0003-139971123980032.parquet] for read: No such file or directory

2) with ON ANY NODE
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON any node PARQUET;

**--> error : [Code: 6764, SQL State: 58V01] [Vertica]VJDBC ERROR: Error opening file [/data/DEVTools/Archive/CR14307891_CRM.fileexport_20180115_130128Zhh8C5NT/15a5612e-db_node0003-139971123980032.parquet] for read: No such file or directory

3) If we issue the 3 COPY commands, it works
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON db_node0001 PARQUET;
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON db_node0002 PARQUET;
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON db_node0003 PARQUET;

Does someone have a clue about what could be wrong with the 2 first commands ?
All are issued with a user who is entitled to read all the files on all the nodes

Thanks
Isa

Isabelle_Vela · January 2018

I don't know why, the '*'have disappeared from the file names in my above post
Pls read them as :

'/data/DEVTools/Archive/fileexport_20180115_130128* /.parquet'

Jim_Knicely · January 2018

Can you post the EXPORT TO PARQUET command you used?

Isabelle_Vela · January 2018

Here it is :
EXPORT TO PARQUET(directory = '/data/DEVTools/Archive/fileexport_20180115_130128') AS SELECT * FROM staging.tbname)

On the node where the export is executed, the directory is created with the given name, but on other nodes, a suffix is added by the Export process .
This is why we put a "*" (star) in the directory name, to be sure we don't miss this suffix at import time

Jim_Knicely · January 2018

Try not telling Vertica a node...

COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128*/*.parquet' PARQUET;

The "ON ANY NODE" parameter specifies that the source file to load is available on all of the nodes. In your case, each individual parquet file is only on one node.

Isabelle_Vela · January 2018

Yes, this imports only data from one node (the current one).

The table tbname contains 24171655 rows.
When executing the 3 COPY commands we get :

[COPY - 8052971 rows, 10.325 secs] Command processed
[COPY - 8058623 rows, 11.775 secs] Command processed
[COPY - 8060061 rows, 17.522 secs] Command processed
select count(*) from staging.tbname returns 24171656 rows

When executing the command you suggest, we get
11:44:16 [COPY - 8058623 rows, 12.570 secs]
this is confirmed with a select count(*)

Jim_Knicely · January 2018

Hi,

I think you have no choice but to execute three copy commands because the files are not the same on each node, thus the ON ANY NODE option is out. Plus the ERROR TOLERANCE feature is not available for PARQUET files.

You could create an NFS mount accessible to each node and export to parquet there, then the ON ANY NODE parameter will work.

Jim_Knicely · January 2018

Assuming there are files on each node, I believe the following command will work for you as a single COPY:

COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128*/*.parquet' ON db_node0001, '/data/DEVTools/Archive/fileexport_20180115_130128*/*.parquet' ON db_node0002, '/data/DEVTools/Archive/fileexport_20180115_130128*/*.parquet' ON db_node0003 PARQUET;

Just make sure all of those directory paths are correct..

Isabelle_Vela · January 2018

Perfect ! It works fine and quicker than the 3 commands !

Thanks a lot

Isabelle_Vela · January 2018

... but in any case we need to know all the nodes...it is not a very generic command... we'll live with it.

We're Moving!

Create My New Community Account Now

Importing from PARQUET files into Vertica from several nodes

Comments

Leave a Comment