Apportioned UDL issue

Hey Everyone -
We are trying to move toward using UDLs to handle our data loads. Right now we ingest a lot of "similar-to-CSV" data in various formats. We use a C# process to transform these files in whatever form they come in to a standardized command that we use Vertica to then ingest (including using a nonsense Unicode character as a separator). What we're doing a POC for is to see if we can just have Vertica do this processing work through C++ UDLs.

The problem is we really want to use apportioning for performance reasons, but we are struggling to figure out how in the Source to provide the headers. In our case some of our files will have the same headers but we don't necessarily know which order those headers will come in. So we need to read the file and map the header columns to standardized output columns. There does not seem to be any way to pass down the header information from the File Apportioning Source class to the Parser.

We're using the terminology and such from this github repo: https://github.com/vertica/UDx-Examples/tree/master/Java-and-C++/ApportionLoadFunctions.

We have tried several things; what we thought would work was extending the portions by the header data and then manually copying the headers in to each portion. But that results in us not reading the correct amount of bytes as it appears the piece who does the reading has no idea the reads we're doing aren't actually going back to the file.

Any thoughts or tips?

Best Answer

Answers

  • Bryan_HBryan_H Vertica Employee Administrator

    Can you create a temp table with the fields in file order, the INSERT INTO ... SELECT FROM the temp table to order the fields into the permanent table?

  • bryanwolfbryanwolf Vertica Customer

    We'll look in to that more - our hope was to avoid the temporary tables as part of this because this is technically already loading to a temp table - so we'd have a temp-temp table and that starts to get very confusing for explaining how this all works :)

    Can you create a temp table from the UDL? It seems like running a query inline would be not recommended - so I guess we'd have to have a step before that to read the first line and build the temp table - am I assuming correctly?

  • bryanwolfbryanwolf Vertica Customer

    After thinking about this more, the problem still is there because we are doing column-by-column data transforms in the UDLs, so we have to write code knowing which column is which. In the apportioned case only the first portion actually gets the headers so subsequent portions would have assume, like, Column 4 contains the columns I want.

    To clarify: imagine I have a column called "Description Code" that's a multi-part field that our value-add is splitting and parsing. Sometimes, in the file, you will find that column in ordinal 1 and other times ordinal 6. However, the file has headers so we can easily handle that iff we can pull the header data in to every portion parser.

    It seems making the columns consistent in a temp table would require pre-parsing the data in a non-apportioned way which is the specific thing we're trying to eliminate as it is our bottleneck.

  • bryanwolfbryanwolf Vertica Customer

    I think that building up a metadata file during pre-processing is an acceptable choice and that's I think the route we want to go down.
    Thanks for your suggestion!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file