Error loading data using PloadDelimitedSource
vertica=> select version(); version
------------------------------------
Vertica Analytic Database v6.1.2-0
(1 row)
vertica=> COPY tableaa WITH SOURCE PloadDelimitedSource(file='/tmp/hitdata.tsv', chunk_size=16) DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/hitdata.exception' REJECTED DATA '/tmp/hitdata.reject' ;ERROR: Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/hitdata.tsv]
vertica=>
Has anyone encountered this issue?
------------------------------------
Vertica Analytic Database v6.1.2-0
(1 row)
vertica=> COPY tableaa WITH SOURCE PloadDelimitedSource(file='/tmp/hitdata.tsv', chunk_size=16) DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/hitdata.exception' REJECTED DATA '/tmp/hitdata.reject' ;ERROR: Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/hitdata.tsv]
vertica=>
Has anyone encountered this issue?
0
Comments
It sounds to me like the dbadmin user (the Linux user that Vertica runs as) doesn't have the necessary file permissions to read that file. Have you verified that that user has access to the file you're trying to load?
The file could also not be readable for other reasons -- SELinux per-program restrictions, disk corruption, etc.
Adam
I'm using RHEL and running this command as "vertica"
Here are the permission details:-
$whoami
vertica
$ ls -lrt /tmp/hitdata.tsv
-rwxrwxrwx 1 vertica verticadba 90687495311 Mar 6 01:27 /tmp/hitdata.tsv
$ file /tmp/hitdata.tsv
/tmp/hit_data.tsv: ASCII text, with very long lines
vertica=> COPY tableaa WITH SOURCE PloadDelimitedSource(file='/tmp/hitdata.tsv', chunk_size=16) DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/hitdata.exception' REJECTED DATA '/tmp/hitdata.reject' ;ERROR: Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/hitdata.tsv]
vertica=>
Just for debugging purposes, use the regular copy (not the PloadDelimitedSource) to load the same file . See if it works.
Please post the results as well.
Thanks,
Satish
Rows Loaded
-------------
53626020
(1 row)
Time: First fetch (1 row): 6901596.759 ms. All rows formatted: 6901596.872 ms
Thanks,
Satish
No changes, same errors.
$ cat /tmp/kv.dat
1 one
2 two
3 three
4 four
5 five
vertica=> COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/tmp/kv.dat') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv' ;ERROR: Error calling prepare() in User Function VPFileSourceFactory at [src/PloadDelimitedFileLib.cpp:282], error code: 0, message: Error in stat() for file [/tmp/kv.dat]
vertica=>
Note: Since this was a tiny file, I removed the chunk_size=16 parameter from this run.
COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/tmp/kv.dat' , nodes='NODENAME') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv' ;
My apologies - i should have said that earlier. They are on shared NAS and not directly on the DB node. I simply changed the paths to "/tmp" for this post.
Does that make a difference? Does the file have to be on the DB node? By the way, the NAS mounted is accessible by the DB nodes.
The above command I posted should work, can you try ?
vertica=>
can you also try this ?
COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/tmp/kv.dat' , nodes='CURRENT NODENAME') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv' ;
You were right. NAS was plugged out of the vertica hosts and hence this problem. I tried your command on the local node and it works!
vertica=>vertica=> COPY omniture_new.kv WITH SOURCE PloadDelimitedSource(file='/home/PORTAL/sramanujam/data/kv.dat' , nodes='v_webmd_node0001') DELIMITER E'\t' DIRECT EXCEPTIONS '/tmp/kv.exception' REJECTED DATA '/tmp/kv.rejects' ;
Rows Loaded
-------------
5
(1 row)
Time: First fetch (1 row): 116.687 ms. All rows formatted: 116.785 ms
vertica=>
I have a question for larger loads though - how does chunk_size help(?) in the load performance? How do i set it for a large file (lets say 10G)
The chunk size parameter is measured in MB (so chunk_size=32 is 32MB) and accepts values from 32MB up to 32GB. It defines how big of a chunk of the main file each thread gets. You don't want to set your chunk_size any higher than file_size / thread_count
no of threads is usually no of cores .
at the same time you don't want to set the chunk size to a much higher value for performance issues. it is relative and depends on your system configuration ( mem,no of cores ,file size etc.,)
COPY myschema.table WITH SOURCE PloadDelimitedSource(file='/path_to_data/data.tsv', chunk_size=1024, nodes='ALL NODES') DELIMITER E'\t';
The above command took 5 hours to complete.
When i used chunk_size=100 and nodes='ANY NODES' it took 27 hours to complete.
I am running this on a non-vertica DB server where the data file resides on NAS that is accessible/visible from the DB hosts.
Any pointers/suggestions to speed this up? Kindly advise.