Sample ETL load script - Help

Can someone help me with a python script to load a file into vertica table. Since it is a huge file, I need to load with 4 parallel threads. Please help.

Best Answer

Answers

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
  • marcothesanemarcothesane - Select Field - Administrator

    The "cleanest" way to make apportioned load happen is to have the load file on a directory that is locally mounted under the exact same directory on all existing Vertica nodes.

    You transfer the uncompressed flat file to that directory, and it's immediately visible for all Vertica nodes with the same path name.

    Uncompressed is necessary as, for apportioned load, each parsing thread of the, say, 8 parsing threads will position at the beginning and end of "their own" 8-th of the file, using fseek(), and then advance byte by byte until they find the next record delimiter, to determine their own portion.
    With a compressed file, you can't do that.

  • sredsassredsas Vertica Customer

    I would recommend you to try using an Apportioned Load https://www.vertica.com/blog/faster-data-loads-with-apportioned-load-quick-tip/ , the best possible way to let python script to load. Hope you make any use of this friend

  • edited January 2021

    Thanks all for your answers! The main constraint I have is defining the parallel threads..
    Suppose,
    if I have less than 1 billion records, then I would like to load with 6 parallel threads .
    if I have more than 1 billion records, then I would like to load with 8 parallel threads and the condition goes on .

    Apologize for late reply! I was travelling.

  • how does Apportion load defines the number of parallel threads? is it based on the resource pool of the user?

  • Nimmi_guptaNimmi_gupta - Select Field - Employee

    @davidvilla yes number of threads depends on executionparallelism from resource pools . By default it set to auto i.e number of cores.
    You can use the below query to check what value has been set to executionparallelism
    select pool_name, execution_parallelism from resource_pool_status;

  • marcothesanemarcothesane - Select Field - Administrator

    ON ANY NODE next to the name of the uncompressed infile leads to each node participating at least once; Collaborative Parsing is the effect that several parsing threads are busy on the same node. EXECUTIONPARALLELISM in the resource pool controls the number of threads; as well as the ApportionedFileMinimumPortionSizeKB parameter.
    Check this part of the docu on the topic:
    https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/DataLoad/UsingParallelLoadStreams.htm

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file