Error loading data using PloadDelimitedSource

vertica=> select version();              version
------------------------------------
 Vertica Analytic Database v6.1.2-0
(1 row)

vertica=> COPY tableaa WITH SOURCE PloadDelimitedSource(file='/tmp/hitdata.tsv', chunk_size=16) DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/hitdata.exception' REJECTED DATA '/tmp/hitdata.reject' ;ERROR:  Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/hitdata.tsv]
vertica=>

Has anyone encountered this issue?

Comments

  • Hi Srivatsan,

    It sounds to me like the dbadmin user (the Linux user that Vertica runs as) doesn't have the necessary file permissions to read that file.  Have you verified that that user has access to the file you're trying to load?

    The file could also not be readable for other reasons -- SELinux per-program restrictions, disk corruption, etc.

    Adam
  • Hi Adam-
    I'm using RHEL and running this command as "vertica" 

    Here are the permission details:-
    $whoami
    vertica

    $ ls -lrt /tmp/hitdata.tsv
    -rwxrwxrwx 1 vertica verticadba 90687495311 Mar  6 01:27 /tmp/hitdata.tsv

    $ file /tmp/hitdata.tsv
    /tmp/hit_data.tsv: ASCII text, with very long lines

    vertica=> COPY tableaa WITH SOURCE PloadDelimitedSource(file='/tmp/hitdata.tsv', chunk_size=16) DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/hitdata.exception' REJECTED DATA '/tmp/hitdata.reject' ;ERROR:  Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/hitdata.tsv]
    vertica=>




  • Hi Srivatsan,

    Just for debugging purposes, use the regular copy (not the PloadDelimitedSource) to load the same file . See if it works. 

    Please post the results as well. 

    Thanks,

    Satish
  • Yes, that worked with no issues. Here is the result.

     Rows Loaded
    -------------
        53626020
    (1 row)

    Time: First fetch (1 row): 6901596.759 ms. All rows formatted: 6901596.872 ms
  • Ok.  sounds good. another quick test.  place simple data file ( may be 2 integer columns) in the same directory , create an equivalent table in vertica and load using PloadDelimitedSource . Just to make sure the simple case works ( not a huge file ).

    Thanks, 

    Satish


  • Hi Satish-
    No changes, same errors.

    $ cat /tmp/kv.dat
    1       one
    2       two
    3       three
    4       four
    5       five

    vertica=> COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/tmp/kv.dat') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv' ;ERROR:  Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/kv.dat]
    vertica=>


    Note: Since this was a tiny  file, I removed the chunk_size=16 parameter from this run.


  • Ok. With my assumption that input file is local to the server , can you try the following.  node name can found from "select * from nodes;" . 

    COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/tmp/kv.dat' , nodes='NODENAME') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv' ;


  • Hi Satish-
    My apologies - i should have said that earlier. They are on shared NAS and not directly on the DB node. I simply changed the paths to "/tmp" for this post.

    Does that make a difference? Does the file have to be on the DB node? By the way, the NAS mounted is accessible by the DB nodes.
  • Ok , no problem. If it is shared across all the DB nodes as NAS , then we can use 'ALL NODES' option for the nodes parameter. 

    The above command I posted should work, can you try ?


  • vertica=> COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/ora_mnt/kv.dat' , nodes='ALL NODES') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv.rejects' ;ERROR:  Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/ora_mnt/kv.dat]

    vertica=>



  • hmm. just to make sure.  when you tried the regular copy the file was still on the NAS location ?

    can you also try this ?

    COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/tmp/kv.dat' , nodes='CURRENT NODENAME') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv' ;
  • Satish-
    You were right. NAS was plugged out of the vertica hosts and hence this problem. I tried your command on the local node and it works!

    vertica=>vertica=>  COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/home/PORTAL/sramanujam/data/kv.dat' , nodes='v_webmd_node0001') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv.rejects' ;
     Rows Loaded
    -------------
               5
    (1 row)

    Time: First fetch (1 row): 116.687 ms. All rows formatted: 116.785 ms
    vertica=>



    I have a question for larger loads though - how does chunk_size help(?) in the load performance? How do i set it for a large file (lets say 10G)
  • cool . 

    The chunk size parameter is measured in MB (so chunk_size=32 is 32MB) and accepts values from 32MB up to 32GB. It defines how big of a chunk of the main file each thread gets. You don't want to set your chunk_size any higher than file_size / thread_count

    no of threads is usually no of cores . 

    at the same time you don't want to set the chunk size to a much higher value for performance issues.  it is relative and depends on your system configuration ( mem,no of cores ,file size etc.,)


  • Thank you Satish, you've been very patient and helpful!
  • For some reason COPY with Pload seems to be much slower than the "glob" version of COPY. Here is my command to load a 80GB file (with ~55M rows)

    COPY myschema.table WITH SOURCE PloadDelimitedSource(file='/path_to_data/data.tsv', chunk_size=1024, nodes='ALL NODES') DELIMITER E'\t';

    The above command took 5 hours to complete.

    When i used chunk_size=100 and nodes='ANY NODES' it took 27 hours to complete.

    I am running this on a non-vertica DB server where the data file resides on NAS that is accessible/visible from the DB hosts.

    Any pointers/suggestions to speed this up? Kindly advise.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file