Convert comma separated string (long varchar) to a list
aoropeza
Vertica Customer ✭
I have a column of type long varchar
that stores elements separated by the symbol ";".
Each record can have a length greater than 100,000 and more than 3,000 elements separated by ";".
I want to get the following
Original Table
id | items |
---|---|
1 | a_1;a_2; ... ; a_i |
2 | b_1;b_2; ... ; b_j |
3 | c_1;c_2; ... ; c_k |
Expected Table
id | item |
---|---|
1 | a_1 |
1 | a_2 |
1 | a_3 |
... | ... |
1 | a_i |
2 | b_1 |
2 | b_2 |
2 | b_3 |
... | ... |
2 | b_j |
I have tried the following:
-- Using Tokenizer select v_txtindex.StringTokenizerDelim(items, ';') over (partition by id) from foo; -- Using slip_part select split_part(items, ';', row_number() over (partition by id)) from foo; -- Using MapItems and MapDelimitedExtractor select id, mapitems(MapDelimitedExtractor(items using parameters delimiter = ';') over (partition best)) from foo;
But I got the following error:
[42883][3457] [Vertica]VJDBC ERROR: Function v_txtindex.StringTokenizerDelim(long varchar, unknown) does not exist, or permission is denied for v_txtindex.StringTokenizerDelim(long varchar, unknown)
Any suggestion to get the expected table.
Thank you very much in advance...
Tagged:
0
Answers
A database might be a poor solution for this problem. This sounds like something a Linux script could probably solve more efficiently, or at least to place it into a format that's more reasonably approachable.
That said, you might consider putting it into a text index instead. That might be more workable.
If you had a
VARCHAR
column,StringTokenizerDelim
would do the job. You could still use it with an explicit cast toVARCHAR
(and maximum length), but that will truncate your items argument; see example below. That said, this function could be extended to supportLONG VARCHAR
arguments. I'm filing an internal ticket to investigate that.Thank you for your prompt replies.
Regarding using a
text index
I tried the following:I can identify where to tell it to use the ";" character as field delimiter.
Are there any points I am missing?
Hello Ariel,
Thank you very much for your reply.
At the moment I am approaching some cases by casting to
VARCHAR(65000)
as you suggest.Extending the function to accept
LONG VARCHAR
would help a lot.Until StringTokenizerDelim will support long varchar, the following works for me.
Thank you very much for your answers. They were beneficial.
Finally, I did the following to work with
long varchar(102400)
fields.I splitted into three queries, fields less than or equal to 65,000 and the fields greater than 65,000 (I divided the content into left and right parts).
The function v_txtindex.StringTokenizer can handle long varchar, but only works with a space separator.
You can replace the semi-colon with a space, but that also means you can't have space in your tokens: