Sample ETL load script - Help
Can someone help me with a python script to load a file into vertica table. Since it is a huge file, I need to load with 4 parallel threads. Please help.
Tagged:
0
Best Answer
-
marcothesane - Select Field - Administrator
ON ANY NODE next to the name of the uncompressed infile leads to each node participating at least once; Collaborative Parsing is the effect that several parsing threads are busy on the same node. EXECUTIONPARALLELISM in the resource pool controls the number of threads; as well as the ApportionedFileMinimumPortionSizeKB parameter.
Check this part of the docu on the topic:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/DataLoad/UsingParallelLoadStreams.htm0
Answers
Why not take advantage of Apportioned Load?
https://www.vertica.com/blog/faster-data-loads-with-apportioned-load-quick-tip/
The "cleanest" way to make apportioned load happen is to have the load file on a directory that is locally mounted under the exact same directory on all existing Vertica nodes.
You transfer the uncompressed flat file to that directory, and it's immediately visible for all Vertica nodes with the same path name.
Uncompressed is necessary as, for apportioned load, each parsing thread of the, say, 8 parsing threads will position at the beginning and end of "their own" 8-th of the file, using fseek(), and then advance byte by byte until they find the next record delimiter, to determine their own portion.
With a compressed file, you can't do that.
I would recommend you to try using an Apportioned Load https://www.vertica.com/blog/faster-data-loads-with-apportioned-load-quick-tip/ , the best possible way to let python script to load. Hope you make any use of this friend
Thanks all for your answers! The main constraint I have is defining the parallel threads..
Suppose,
if I have less than 1 billion records, then I would like to load with 6 parallel threads .
if I have more than 1 billion records, then I would like to load with 8 parallel threads and the condition goes on .
Apologize for late reply! I was travelling.
how does Apportion load defines the number of parallel threads? is it based on the resource pool of the user?
@davidvilla yes number of threads depends on executionparallelism from resource pools . By default it set to auto i.e number of cores.
You can use the below query to check what value has been set to executionparallelism
select pool_name, execution_parallelism from resource_pool_status;
ON ANY NODE next to the name of the uncompressed infile leads to each node participating at least once; Collaborative Parsing is the effect that several parsing threads are busy on the same node. EXECUTIONPARALLELISM in the resource pool controls the number of threads; as well as the ApportionedFileMinimumPortionSizeKB parameter.
Check this part of the docu on the topic:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/DataLoad/UsingParallelLoadStreams.htm