Apportioned UDL issue

bryanwolf · April 2020

Hey Everyone -
We are trying to move toward using UDLs to handle our data loads. Right now we ingest a lot of "similar-to-CSV" data in various formats. We use a C# process to transform these files in whatever form they come in to a standardized command that we use Vertica to then ingest (including using a nonsense Unicode character as a separator). What we're doing a POC for is to see if we can just have Vertica do this processing work through C++ UDLs.

The problem is we really want to use apportioning for performance reasons, but we are struggling to figure out how in the Source to provide the headers. In our case some of our files will have the same headers but we don't necessarily know which order those headers will come in. So we need to read the file and map the header columns to standardized output columns. There does not seem to be any way to pass down the header information from the File Apportioning Source class to the Parser.

We're using the terminology and such from this github repo: https://github.com/vertica/UDx-Examples/tree/master/Java-and-C++/ApportionLoadFunctions.

We have tried several things; what we thought would work was extending the portions by the header data and then manually copying the headers in to each portion. But that results in us not reading the correct amount of bytes as it appears the piece who does the reading has no idea the reads we're doing aren't actually going back to the file.

Any thoughts or tips?

Bryan_H · April 2020

Would it be possible to pass in a string describing the input structure as an argument to the UDX? This is how I've implemented a JDBCSource, by allowing user to pass connection string and query as arguments, though not sure whether this would parallelize well: https://github.com/bryanherger/vertica-java-udl/blob/master/src/main/java/com/bryanherger/udparser/JDBCLoaderFactory.java

Extending the above, you could create a metadata file describing the input on a shared location visible to all nodes, and pass that as an argument. Then each instance could pick up the metadata and act on columns as directed by the metadata file.

Another option if it's necessary to specify column name might be to transform the input into JSON. This would add the field name to every row, though this would greatly increase the data size and probably slow down disk and I/O considerably if the input data set is very large.

Bryan_H · April 2020

Can you create a temp table with the fields in file order, the INSERT INTO ... SELECT FROM the temp table to order the fields into the permanent table?

bryanwolf · April 2020

We'll look in to that more - our hope was to avoid the temporary tables as part of this because this is technically already loading to a temp table - so we'd have a temp-temp table and that starts to get very confusing for explaining how this all works

Can you create a temp table from the UDL? It seems like running a query inline would be not recommended - so I guess we'd have to have a step before that to read the first line and build the temp table - am I assuming correctly?

bryanwolf · April 2020

After thinking about this more, the problem still is there because we are doing column-by-column data transforms in the UDLs, so we have to write code knowing which column is which. In the apportioned case only the first portion actually gets the headers so subsequent portions would have assume, like, Column 4 contains the columns I want.

To clarify: imagine I have a column called "Description Code" that's a multi-part field that our value-add is splitting and parsing. Sometimes, in the file, you will find that column in ordinal 1 and other times ordinal 6. However, the file has headers so we can easily handle that iff we can pull the header data in to every portion parser.

It seems making the columns consistent in a temp table would require pre-parsing the data in a non-apportioned way which is the specific thing we're trying to eliminate as it is our bottleneck.

bryanwolf · April 2020

I think that building up a metadata file during pre-processing is an acceptable choice and that's I think the route we want to go down.
Thanks for your suggestion!

Apportioned UDL issue

Best Answer

Answers

Leave a Comment