remove pruned columns dynamically using sdk
phil2
✭
Hello!
There is an ODBCLoader example (https://github.com/vertica/ODBC-Loader) that shows how to avoid fetching unused columns from the source system. But you still have to pass them filled with nulls through the Vertica SDK. Turns out it is very time-consuming too. If you query a single column out of external table with 20 columns it can be 10 times slower than fetching a single column out of external table with only one column.
Can you advise me if there is a way to avoid passing nulls for pruned columns out of UDParser?
0
Comments
I saw there is a RLETuple type oid. Is it possible to use it to reduce the size of data stream returning from UDParser? A little help will be appritiated. Or maybe you know someone I can reach who knows the answer.
I'm not sure it's avoidable - the number of columns parsed has to match the table DDL, so if 20 columns are defined, 20 are expected, even if 19 are set to NULL. What appears to be the time-consuming step: the query to the remote database, or processing the data after it's retrieved? This might help us narrow down and optimize.
Also, if possible, open a support case, or file an issue with GitHub.
You could also try dblink, which implements ODBC connectivity as a UDTF and allows you to configure SQL specific to a query: https://github.com/vertica/dblink
Passing data to vertica takes more than a half time. I dont even fill the nulls for unused columns, as they are not being used in a query. So I do exactly the same work whenever there is a one or twenty columns (read one column from the source and fill it in the writer buffer for that column). Moreover, I see that the number of rows available in writer decreases while adding new columns in external table interface. As if there is a limited RAM buffer with rows * columns size. Maybe that is what makes it slow down.
For now I came up with implementing RLE by myself and deduplicating rows instead of removing columns.
I'm looking for a general solution, and dblink with specific query would be too complicated for users.
Another thought is to use arrays. But for an array I cannot get the used columns. It is a complete array or nothing choice.
I also asked a question here https://github.com/vertica/ODBC-Loader/issues/16