Convert comma separated string (long varchar) to a list

aoropeza · September 2022

I have a column of type long varchar that stores elements separated by the symbol ";".

Each record can have a length greater than 100,000 and more than 3,000 elements separated by ";".

I want to get the following

Original Table

id	items
1	a_1;a_2; ... ; a_i
2	b_1;b_2; ... ; b_j
3	c_1;c_2; ... ; c_k

Expected Table

id	item
1	a_1
1	a_2
1	a_3
...	...
1	a_i
2	b_1
2	b_2
2	b_3
...	...
2	b_j

I have tried the following:

-- Using Tokenizer
select v_txtindex.StringTokenizerDelim(items, ';') over (partition by id) from foo;

-- Using slip_part
select split_part(items, ';', row_number() over (partition by id)) from foo;

-- Using MapItems and MapDelimitedExtractor
select id, mapitems(MapDelimitedExtractor(items using parameters delimiter = ';') over (partition best)) from foo;

But I got the following error:

[42883][3457] [Vertica]VJDBC ERROR: Function v_txtindex.StringTokenizerDelim(long varchar, unknown) does not exist, or permission is denied for v_txtindex.StringTokenizerDelim(long varchar, unknown)

Any suggestion to get the expected table.

Thank you very much in advance...

Vertica_Curtis · September 2022

A database might be a poor solution for this problem. This sounds like something a Linux script could probably solve more efficiently, or at least to place it into a format that's more reasonably approachable.

That said, you might consider putting it into a text index instead. That might be more workable.

Ariel_Cary · September 2022

If you had a VARCHAR column, StringTokenizerDelim would do the job. You could still use it with an explicit cast to VARCHAR (and maximum length), but that will truncate your items argument; see example below. That said, this function could be extended to support LONG VARCHAR arguments. I'm filing an internal ticket to investigate that.

select * from t;
 id | items 
----+-------
  1 | a;b;c
  2 | d;e;f
(2 rows)

select id, v_txtindex.StringTokenizerDelim(items::varchar(65000), ';') over(partition by id) from t;
 id | words 
----+-------
  1 | a
  1 | b
  1 | c
  2 | d
  2 | e
  2 | f
(6 rows)

aoropeza · September 2022

Thank you for your prompt replies.

Regarding using a text index I tried the following:

-- Create projection
drop projection if exists proj_items;
create projection proj_items as
    select id, items
    from items
    order by id
    segmented by hash(id) all nodes;
select start_refresh();

-- Create Text Index
drop text index if exists txt_index_items;
create text index txt_index_items on items (id, items)
tokenizer v_txtindex.StringTokenizer(long varchar, long varchar)
stemmer none;

-- Select index text
select * from txt_index_items;

-- But I got the following:
| token              |doc_id |
|--------------------|-------|
| a_1;a_2; ... ; a_i |  1    |
| b_1;b_2; ... ; b_j |  2    |
| c_1;c_2; ... ; c_k |  3    |

I can identify where to tell it to use the ";" character as field delimiter.

Are there any points I am missing?

aoropeza · September 2022

Hello Ariel,

Thank you very much for your reply.

At the moment I am approaching some cases by casting to VARCHAR(65000) as you suggest.

Extending the function to accept LONG VARCHAR would help a lot.

mosheg · September 2022

Until StringTokenizerDelim will support long varchar, the following works for me.

drop table tv cascade;
create table tv (id int, f1 long varchar(65000));
\set MYLINE1 `printf '%s;' {1000000..1000005}`
\set MYLINE1 '''':MYLINE1''''
\set MYLINE2 `printf '%s;' {2000099..2000094}`
\set MYLINE2 '''':MYLINE2''''
insert into tv (id, f1) values (1, :MYLINE1);
insert into tv (id, f1) values (2, :MYLINE2);
commit;

SELECT * FROM tv;
   id |                        f1
----+--------------------------------------------------
    2 | 2000099;2000098;2000097;2000096;2000095;2000094;
    1 | 1000000;1000001;1000002;1000003;1000004;1000005;
(2 rows)

-- Example(1):
SELECT id, f1
FROM (SELECT id, split_part(f1::varchar(65000), ';', row_number() over (PARTITION BY id)) f1
FROM tv
CROSS JOIN columns) foo
WHERE f1 <> ''
ORDER BY 1,2;
 id |   f1
----+---------
  1 | 1000000
  1 | 1000001
  1 | 1000002
  1 | 1000003
  1 | 1000004
  1 | 1000005
  2 | 2000094
  2 | 2000095
  2 | 2000096
  2 | 2000097
  2 | 2000098
  2 | 2000099
(12 rows)

-- Example(2):
SELECT id, f2 as f1
FROM (SELECT id, v_txtindex.StringTokenizerDelim(f1::varchar(65000), ';') OVER (PARTITION BY id ORDER BY id) as f2 FROM tv) foo
ORDER BY 1, 2;
 id |   f1
----+---------
  1 | 1000000
  1 | 1000001
  1 | 1000002
  1 | 1000003
  1 | 1000004
  1 | 1000005
  2 | 2000094
  2 | 2000095
  2 | 2000096
  2 | 2000097
  2 | 2000098
  2 | 2000099
(12 rows)

aoropeza · September 2022

Thank you very much for your answers. They were beneficial.

Finally, I did the following to work with long varchar(102400) fields.

I splitted into three queries, fields less than or equal to 65,000 and the fields greater than 65,000 (I divided the content into left and right parts).

with short_items as
         (select id, v_txtindex.StringTokenizerDelim(items::varchar(65000), ';') over (partition by id) items from foo where length(items) <= 65000),
     left_items as
         (select id, v_txtindex.StringTokenizerDelim(
             substr(items, 1, regexp_instr(items, ';', floor(length(items) / 2)::int + 1, 1, 1))::varchar(65000), ';') over (partition by id) items from foo where length(items) > 65000),
     right_items as
         (select id, v_txtindex.StringTokenizerDelim(substr(items, regexp_instr(items, ';', floor(length(items) / 2)::int + 1, 1, 1), length(items)::int)::varchar(65000), ';') over (partition by id) items from foo where length(items) > 65000)
select id, items from short_items
union all
select id, items from left_items
union all
select id, items from right_items;

VValdar · September 2022

The function v_txtindex.StringTokenizer can handle long varchar, but only works with a space separator.
You can replace the semi-colon with a space, but that also means you can't have space in your tokens:

create table tst_token
( id     integer
, items  long varchar
)
order by id
segmented by hash(id) all nodes;

insert into tst_token values (1, 'a 100k length string generated on http://www.unit-conversion.info/texttools/random-string-generator with semicolon');

-- successive semicolons doesn't create empty rows, those are discarded thus the max_items (in my test it generated 1579 rows)
select id, length(items), length(items) - length(replace(items, ';', '')) as max_items
  from tst_token;

id    length  max_items
----  ------  ---------
   1  100000       1609

-- Works fine:
select id
     , v_txtindex.StringTokenizer('', replace(items, ';', ' ')) over(partition by id)
  from tst_token;

We're Moving!

Create My New Community Account Now

Convert comma separated string (long varchar) to a list

Answers

Leave a Comment