Importing from PARQUET files into Vertica from several nodes

Hello
I found some posts about this topic but as they are quite old, I prefer opening a new one.
Our Vertica version is v8.1.
We export data to PARQUET files, but we are not able to import them back with the COPY commands described in the documentation to handle several nodes

1) with all nodes between parenthesis
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON (db_node0001,db_node0002, db_node0003) PARQUET;

--> error : [Code: 6764, SQL State: 58V01] [Vertica]VJDBC ERROR: Error opening file [/data/DEVTools/Archive/fileexport_20180115_130128Zhh8C5NT/15a5612e-db_node0003-139971123980032.parquet] for read: No such file or directory

2) with ON ANY NODE
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON any node PARQUET;

**--> error : [Code: 6764, SQL State: 58V01] [Vertica]VJDBC ERROR: Error opening file [/data/DEVTools/Archive/CR14307891_CRM.fileexport_20180115_130128Zhh8C5NT/15a5612e-db_node0003-139971123980032.parquet] for read: No such file or directory

3) If we issue the 3 COPY commands, it works
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON db_node0001 PARQUET;
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON db_node0002 PARQUET;
COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128/.parquet' ON db_node0003 PARQUET;

Does someone have a clue about what could be wrong with the 2 first commands ?
All are issued with a user who is entitled to read all the files on all the nodes

Thanks
Isa

Comments

  • I don't know why, the '*'have disappeared from the file names in my above post :/
    Pls read them as :

    '/data/DEVTools/Archive/fileexport_20180115_130128* /.parquet'

  • Jim_KnicelyJim_Knicely Administrator

    Can you post the EXPORT TO PARQUET command you used?

  • Here it is :
    EXPORT TO PARQUET(directory = '/data/DEVTools/Archive/fileexport_20180115_130128') AS SELECT * FROM staging.tbname)

    On the node where the export is executed, the directory is created with the given name, but on other nodes, a suffix is added by the Export process .
    This is why we put a "*" (star) in the directory name, to be sure we don't miss this suffix at import time

  • Jim_KnicelyJim_Knicely Administrator
    edited January 2018

    Try not telling Vertica a node...

    COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128*/*.parquet' PARQUET;

    The "ON ANY NODE" parameter specifies that the source file to load is available on all of the nodes. In your case, each individual parquet file is only on one node.

  • Yes, this imports only data from one node (the current one).

    The table tbname contains 24171655 rows.
    When executing the 3 COPY commands we get :

    [COPY - 8052971 rows, 10.325 secs] Command processed
    [COPY - 8058623 rows, 11.775 secs] Command processed
    [COPY - 8060061 rows, 17.522 secs] Command processed
    select count(*) from staging.tbname returns 24171656 rows

    When executing the command you suggest, we get
    11:44:16 [COPY - 8058623 rows, 12.570 secs]
    this is confirmed with a select count(*)

  • Jim_KnicelyJim_Knicely Administrator
    edited January 2018

    Hi,

    I think you have no choice but to execute three copy commands because the files are not the same on each node, thus the ON ANY NODE option is out. Plus the ERROR TOLERANCE feature is not available for PARQUET files.

    You could create an NFS mount accessible to each node and export to parquet there, then the ON ANY NODE parameter will work.

  • Jim_KnicelyJim_Knicely Administrator
    edited January 2018

    Assuming there are files on each node, I believe the following command will work for you as a single COPY:

    COPY staging.tbname FROM '/data/DEVTools/Archive/fileexport_20180115_130128*/*.parquet' ON db_node0001, '/data/DEVTools/Archive/fileexport_20180115_130128*/*.parquet' ON db_node0002, '/data/DEVTools/Archive/fileexport_20180115_130128*/*.parquet' ON db_node0003 PARQUET;
    

    Just make sure all of those directory paths are correct..

  • Perfect ! It works fine and quicker than the 3 commands !

    Thanks a lot

  • ... but in any case we need to know all the nodes...it is not a very generic command... we'll live with it.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file