Convert comma separated string (long varchar) to a list

aoropezaaoropeza Vertica Customer

I have a column of type long varchar that stores elements separated by the symbol ";".

Each record can have a length greater than 100,000 and more than 3,000 elements separated by ";".

I want to get the following

Original Table

id items
1 a_1;a_2; ... ; a_i
2 b_1;b_2; ... ; b_j
3 c_1;c_2; ... ; c_k

Expected Table

id item
1 a_1
1 a_2
1 a_3
... ...
1 a_i
2 b_1
2 b_2
2 b_3
... ...
2 b_j

I have tried the following:

-- Using Tokenizer
select v_txtindex.StringTokenizerDelim(items, ';') over (partition by id) from foo;

-- Using slip_part
select split_part(items, ';', row_number() over (partition by id)) from foo;

-- Using MapItems and MapDelimitedExtractor
select id, mapitems(MapDelimitedExtractor(items using parameters delimiter = ';') over (partition best)) from foo;

But I got the following error:

[42883][3457] [Vertica]VJDBC ERROR: Function v_txtindex.StringTokenizerDelim(long varchar, unknown) does not exist, or permission is denied for v_txtindex.StringTokenizerDelim(long varchar, unknown)

Any suggestion to get the expected table.

Thank you very much in advance...

Answers

  • A database might be a poor solution for this problem. This sounds like something a Linux script could probably solve more efficiently, or at least to place it into a format that's more reasonably approachable.

    That said, you might consider putting it into a text index instead. That might be more workable.

  • Ariel_CaryAriel_Cary Vertica Employee Employee

    If you had a VARCHAR column, StringTokenizerDelim would do the job. You could still use it with an explicit cast to VARCHAR (and maximum length), but that will truncate your items argument; see example below. That said, this function could be extended to support LONG VARCHAR arguments. I'm filing an internal ticket to investigate that.

    select * from t;
     id | items 
    ----+-------
      1 | a;b;c
      2 | d;e;f
    (2 rows)
    
    select id, v_txtindex.StringTokenizerDelim(items::varchar(65000), ';') over(partition by id) from t;
     id | words 
    ----+-------
      1 | a
      1 | b
      1 | c
      2 | d
      2 | e
      2 | f
    (6 rows)
    
  • aoropezaaoropeza Vertica Customer

    Thank you for your prompt replies.

    Regarding using a text index I tried the following:

    -- Create projection
    drop projection if exists proj_items;
    create projection proj_items as
        select id, items
        from items
        order by id
        segmented by hash(id) all nodes;
    select start_refresh();
    
    -- Create Text Index
    drop text index if exists txt_index_items;
    create text index txt_index_items on items (id, items)
    tokenizer v_txtindex.StringTokenizer(long varchar, long varchar)
    stemmer none;
    
    -- Select index text
    select * from txt_index_items;
    
    -- But I got the following:
    | token              |doc_id |
    |--------------------|-------|
    | a_1;a_2; ... ; a_i |  1    |
    | b_1;b_2; ... ; b_j |  2    |
    | c_1;c_2; ... ; c_k |  3    |
    
    

    I can identify where to tell it to use the ";" character as field delimiter.

    Are there any points I am missing?

  • aoropezaaoropeza Vertica Customer

    Hello Ariel,

    Thank you very much for your reply.

    At the moment I am approaching some cases by casting to VARCHAR(65000) as you suggest.

    Extending the function to accept LONG VARCHAR would help a lot.

  • moshegmosheg Vertica Employee Administrator

    Until StringTokenizerDelim will support long varchar, the following works for me.

    drop table tv cascade;
    create table tv (id int, f1 long varchar(65000));
    \set MYLINE1 `printf '%s;' {1000000..1000005}`
    \set MYLINE1 '''':MYLINE1''''
    \set MYLINE2 `printf '%s;' {2000099..2000094}`
    \set MYLINE2 '''':MYLINE2''''
    insert into tv (id, f1) values (1, :MYLINE1);
    insert into tv (id, f1) values (2, :MYLINE2);
    commit;
    
    SELECT * FROM tv;
       id |                        f1
    ----+--------------------------------------------------
        2 | 2000099;2000098;2000097;2000096;2000095;2000094;
        1 | 1000000;1000001;1000002;1000003;1000004;1000005;
    (2 rows)
    
    -- Example(1):
    SELECT id, f1
    FROM (SELECT id, split_part(f1::varchar(65000), ';', row_number() over (PARTITION BY id)) f1
    FROM tv
    CROSS JOIN columns) foo
    WHERE f1 <> ''
    ORDER BY 1,2;
     id |   f1
    ----+---------
      1 | 1000000
      1 | 1000001
      1 | 1000002
      1 | 1000003
      1 | 1000004
      1 | 1000005
      2 | 2000094
      2 | 2000095
      2 | 2000096
      2 | 2000097
      2 | 2000098
      2 | 2000099
    (12 rows)
    
    -- Example(2):
    SELECT id, f2 as f1
    FROM (SELECT id, v_txtindex.StringTokenizerDelim(f1::varchar(65000), ';') OVER (PARTITION BY id ORDER BY id) as f2 FROM tv) foo
    ORDER BY 1, 2;
     id |   f1
    ----+---------
      1 | 1000000
      1 | 1000001
      1 | 1000002
      1 | 1000003
      1 | 1000004
      1 | 1000005
      2 | 2000094
      2 | 2000095
      2 | 2000096
      2 | 2000097
      2 | 2000098
      2 | 2000099
    (12 rows)
    
    
  • aoropezaaoropeza Vertica Customer

    Thank you very much for your answers. They were beneficial.

    Finally, I did the following to work with long varchar(102400) fields.

    I splitted into three queries, fields less than or equal to 65,000 and the fields greater than 65,000 (I divided the content into left and right parts).

    with short_items as
             (select id, v_txtindex.StringTokenizerDelim(items::varchar(65000), ';') over (partition by id) items from foo where length(items) <= 65000),
         left_items as
             (select id, v_txtindex.StringTokenizerDelim(
                 substr(items, 1, regexp_instr(items, ';', floor(length(items) / 2)::int + 1, 1, 1))::varchar(65000), ';') over (partition by id) items from foo where length(items) > 65000),
         right_items as
             (select id, v_txtindex.StringTokenizerDelim(substr(items, regexp_instr(items, ';', floor(length(items) / 2)::int + 1, 1, 1), length(items)::int)::varchar(65000), ';') over (partition by id) items from foo where length(items) > 65000)
    select id, items from short_items
    union all
    select id, items from left_items
    union all
    select id, items from right_items;
    
  • VValdarVValdar Vertica Employee Employee
    edited September 2022

    The function v_txtindex.StringTokenizer can handle long varchar, but only works with a space separator.
    You can replace the semi-colon with a space, but that also means you can't have space in your tokens:

    create table tst_token
    ( id     integer
    , items  long varchar
    )
    order by id
    segmented by hash(id) all nodes;
    
    insert into tst_token values (1, 'a 100k length string generated on http://www.unit-conversion.info/texttools/random-string-generator with semicolon');
    
    -- successive semicolons doesn't create empty rows, those are discarded thus the max_items (in my test it generated 1579 rows)
    select id, length(items), length(items) - length(replace(items, ';', '')) as max_items
      from tst_token;
    
    id    length  max_items
    ----  ------  ---------
       1  100000       1609
    
    -- Works fine:
    select id
         , v_txtindex.StringTokenizer('', replace(items, ';', ' ')) over(partition by id)
      from tst_token;
    
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file