Sample ETL load script - Help

davidvilla · January 2021

Can someone help me with a python script to load a file into vertica table. Since it is a huge file, I need to load with 4 parallel threads. Please help.

marcothesane · February 2021

ON ANY NODE next to the name of the uncompressed infile leads to each node participating at least once; Collaborative Parsing is the effect that several parsing threads are busy on the same node. EXECUTIONPARALLELISM in the resource pool controls the number of threads; as well as the ApportionedFileMinimumPortionSizeKB parameter.
Check this part of the docu on the topic:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/DataLoad/UsingParallelLoadStreams.htm

Jim_Knicely · January 2021

Why not take advantage of Apportioned Load?

https://www.vertica.com/blog/faster-data-loads-with-apportioned-load-quick-tip/

marcothesane · January 2021

The "cleanest" way to make apportioned load happen is to have the load file on a directory that is locally mounted under the exact same directory on all existing Vertica nodes.

You transfer the uncompressed flat file to that directory, and it's immediately visible for all Vertica nodes with the same path name.

Uncompressed is necessary as, for apportioned load, each parsing thread of the, say, 8 parsing threads will position at the beginning and end of "their own" 8-th of the file, using fseek(), and then advance byte by byte until they find the next record delimiter, to determine their own portion.
With a compressed file, you can't do that.

sredsas · January 2021

I would recommend you to try using an Apportioned Load https://www.vertica.com/blog/faster-data-loads-with-apportioned-load-quick-tip/ , the best possible way to let python script to load. Hope you make any use of this friend

davidvilla · January 2021

Thanks all for your answers! The main constraint I have is defining the parallel threads..
Suppose,
if I have less than 1 billion records, then I would like to load with 6 parallel threads .
if I have more than 1 billion records, then I would like to load with 8 parallel threads and the condition goes on .

Apologize for late reply! I was travelling.

davidvilla · February 2021

how does Apportion load defines the number of parallel threads? is it based on the resource pool of the user?

Nimmi_gupta · February 2021

@davidvilla yes number of threads depends on executionparallelism from resource pools . By default it set to auto i.e number of cores.
You can use the below query to check what value has been set to executionparallelism
select pool_name, execution_parallelism from resource_pool_status;

marcothesane · February 2021

ON ANY NODE next to the name of the uncompressed infile leads to each node participating at least once; Collaborative Parsing is the effect that several parsing threads are busy on the same node. EXECUTIONPARALLELISM in the resource pool controls the number of threads; as well as the ApportionedFileMinimumPortionSizeKB parameter.
Check this part of the docu on the topic:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/DataLoad/UsingParallelLoadStreams.htm

Sample ETL load script - Help

Best Answer

Answers

Leave a Comment