Error loading data using PloadDelimitedSource

Srivatsan · March 2014

vertica=> select version(); version
------------------------------------
Vertica Analytic Database v6.1.2-0
(1 row)

vertica=> COPY tableaa WITH SOURCE PloadDelimitedSource(file='/tmp/hitdata.tsv', chunk_size=16) DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/hitdata.exception' REJECTED DATA '/tmp/hitdata.reject' ;ERROR: Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/hitdata.tsv]
vertica=>

Has anyone encountered this issue?

[Deleted User] · March 2014

Hi Srivatsan,

It sounds to me like the dbadmin user (the Linux user that Vertica runs as) doesn't have the necessary file permissions to read that file. Have you verified that that user has access to the file you're trying to load?

The file could also not be readable for other reasons -- SELinux per-program restrictions, disk corruption, etc.

Adam

Srivatsan · March 2014

Hi Adam-
I'm using RHEL and running this command as "vertica"

Here are the permission details:-
$whoami
vertica

$ ls -lrt /tmp/hitdata.tsv
-rwxrwxrwx 1 vertica verticadba 90687495311 Mar 6 01:27 /tmp/hitdata.tsv

$ file /tmp/hitdata.tsv
/tmp/hit_data.tsv: ASCII text, with very long lines

vertica=> COPY tableaa WITH SOURCE PloadDelimitedSource(file='/tmp/hitdata.tsv', chunk_size=16) DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/hitdata.exception' REJECTED DATA '/tmp/hitdata.reject' ;ERROR: Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/hitdata.tsv]
vertica=>

[Deleted User] · March 2014

Hi Srivatsan,

Just for debugging purposes, use the regular copy (not the PloadDelimitedSource) to load the same file . See if it works.

Please post the results as well.

Thanks,

Satish

Srivatsan · March 2014

Yes, that worked with no issues. Here is the result.

Rows Loaded
-------------
53626020
(1 row)

Time: First fetch (1 row): 6901596.759 ms. All rows formatted: 6901596.872 ms

[Deleted User] · March 2014

Ok. sounds good. another quick test. place simple data file ( may be 2 integer columns) in the same directory , create an equivalent table in vertica and load using PloadDelimitedSource . Just to make sure the simple case works ( not a huge file ).

Thanks,

Satish

Srivatsan · March 2014

Hi Satish-
No changes, same errors.

$ cat /tmp/kv.dat
1 one
2 two
3 three
4 four
5 five

vertica=> COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/tmp/kv.dat') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv' ;ERROR: Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/kv.dat]
vertica=>

Note: Since this was a tiny file, I removed the chunk_size=16 parameter from this run.

[Deleted User] · March 2014

Ok. With my assumption that input file is local to the server , can you try the following. node name can found from "select * from nodes;" .

COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/tmp/kv.dat' , nodes='NODENAME') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv' ;

Srivatsan · March 2014

Hi Satish-
My apologies - i should have said that earlier. They are on shared NAS and not directly on the DB node. I simply changed the paths to "/tmp" for this post.

Does that make a difference? Does the file have to be on the DB node? By the way, the NAS mounted is accessible by the DB nodes.

[Deleted User] · March 2014

Ok , no problem. If it is shared across all the DB nodes as NAS , then we can use 'ALL NODES' option for the nodes parameter.

The above command I posted should work, can you try ?

Srivatsan · March 2014

vertica=> COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/ora_mnt/kv.dat' , nodes='ALL NODES') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv.rejects' ;ERROR: Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/ora_mnt/kv.dat]

vertica=>

[Deleted User] · March 2014

hmm. just to make sure. when you tried the regular copy the file was still on the NAS location ?

can you also try this ?

COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/tmp/kv.dat' , nodes='CURRENT NODENAME') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv' ;

Srivatsan · March 2014

Satish-
You were right. NAS was plugged out of the vertica hosts and hence this problem. I tried your command on the local node and it works!

vertica=>vertica=> COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/home/PORTAL/sramanujam/data/kv.dat' , nodes='v_webmd_node0001') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv.rejects' ;
Rows Loaded
-------------
5
(1 row)

Time: First fetch (1 row): 116.687 ms. All rows formatted: 116.785 ms
vertica=>

I have a question for larger loads though - how does chunk_size help(?) in the load performance? How do i set it for a large file (lets say 10G)

[Deleted User] · March 2014

cool .

The chunk size parameter is measured in MB (so chunk_size=32 is 32MB) and accepts values from 32MB up to 32GB. It defines how big of a chunk of the main file each thread gets. You don't want to set your chunk_size any higher than file_size / thread_count

no of threads is usually no of cores .

at the same time you don't want to set the chunk size to a much higher value for performance issues. it is relative and depends on your system configuration ( mem,no of cores ,file size etc.,)

Srivatsan · March 2014

Thank you Satish, you've been very patient and helpful!

Srivatsan · April 2014

For some reason COPY with Pload seems to be much slower than the "glob" version of COPY. Here is my command to load a 80GB file (with ~55M rows)

COPY myschema.table WITH SOURCE PloadDelimitedSource(file='/path_to_data/data.tsv', chunk_size=1024, nodes='ALL NODES') DELIMITER E'\t';

The above command took 5 hours to complete.

When i used chunk_size=100 and nodes='ANY NODES' it took 27 hours to complete.

I am running this on a non-vertica DB server where the data file resides on NAS that is accessible/visible from the DB hosts.

Any pointers/suggestions to speed this up? Kindly advise.

We're Moving!

Create My New Community Account Now

Error loading data using PloadDelimitedSource

Comments

Leave a Comment