We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Convert comma separated string (long varchar) to a list — Vertica Forum

Convert comma separated string (long varchar) to a list

aoropezaaoropeza Vertica Customer

I have a column of type long varchar that stores elements separated by the symbol ";".

Each record can have a length greater than 100,000 and more than 3,000 elements separated by ";".

I want to get the following

Original Table

id items
1 a_1;a_2; ... ; a_i
2 b_1;b_2; ... ; b_j
3 c_1;c_2; ... ; c_k

Expected Table

id item
1 a_1
1 a_2
1 a_3
... ...
1 a_i
2 b_1
2 b_2
2 b_3
... ...
2 b_j

I have tried the following:

-- Using Tokenizer
select v_txtindex.StringTokenizerDelim(items, ';') over (partition by id) from foo;

-- Using slip_part
select split_part(items, ';', row_number() over (partition by id)) from foo;

-- Using MapItems and MapDelimitedExtractor
select id, mapitems(MapDelimitedExtractor(items using parameters delimiter = ';') over (partition best)) from foo;

But I got the following error:

[42883][3457] [Vertica]VJDBC ERROR: Function v_txtindex.StringTokenizerDelim(long varchar, unknown) does not exist, or permission is denied for v_txtindex.StringTokenizerDelim(long varchar, unknown)

Any suggestion to get the expected table.

Thank you very much in advance...

Answers

  • A database might be a poor solution for this problem. This sounds like something a Linux script could probably solve more efficiently, or at least to place it into a format that's more reasonably approachable.

    That said, you might consider putting it into a text index instead. That might be more workable.

  • Ariel_CaryAriel_Cary Vertica Employee Employee

    If you had a VARCHAR column, StringTokenizerDelim would do the job. You could still use it with an explicit cast to VARCHAR (and maximum length), but that will truncate your items argument; see example below. That said, this function could be extended to support LONG VARCHAR arguments. I'm filing an internal ticket to investigate that.

    select * from t;
     id | items 
    ----+-------
      1 | a;b;c
      2 | d;e;f
    (2 rows)
    
    select id, v_txtindex.StringTokenizerDelim(items::varchar(65000), ';') over(partition by id) from t;
     id | words 
    ----+-------
      1 | a
      1 | b
      1 | c
      2 | d
      2 | e
      2 | f
    (6 rows)
    
  • aoropezaaoropeza Vertica Customer

    Thank you for your prompt replies.

    Regarding using a text index I tried the following:

    -- Create projection
    drop projection if exists proj_items;
    create projection proj_items as
        select id, items
        from items
        order by id
        segmented by hash(id) all nodes;
    select start_refresh();
    
    -- Create Text Index
    drop text index if exists txt_index_items;
    create text index txt_index_items on items (id, items)
    tokenizer v_txtindex.StringTokenizer(long varchar, long varchar)
    stemmer none;
    
    -- Select index text
    select * from txt_index_items;
    
    -- But I got the following:
    | token              |doc_id |
    |--------------------|-------|
    | a_1;a_2; ... ; a_i |  1    |
    | b_1;b_2; ... ; b_j |  2    |
    | c_1;c_2; ... ; c_k |  3    |
    
    

    I can identify where to tell it to use the ";" character as field delimiter.

    Are there any points I am missing?

  • aoropezaaoropeza Vertica Customer

    Hello Ariel,

    Thank you very much for your reply.

    At the moment I am approaching some cases by casting to VARCHAR(65000) as you suggest.

    Extending the function to accept LONG VARCHAR would help a lot.

  • moshegmosheg Vertica Employee Administrator

    Until StringTokenizerDelim will support long varchar, the following works for me.

    drop table tv cascade;
    create table tv (id int, f1 long varchar(65000));
    \set MYLINE1 `printf '%s;' {1000000..1000005}`
    \set MYLINE1 '''':MYLINE1''''
    \set MYLINE2 `printf '%s;' {2000099..2000094}`
    \set MYLINE2 '''':MYLINE2''''
    insert into tv (id, f1) values (1, :MYLINE1);
    insert into tv (id, f1) values (2, :MYLINE2);
    commit;
    
    SELECT * FROM tv;
       id |                        f1
    ----+--------------------------------------------------
        2 | 2000099;2000098;2000097;2000096;2000095;2000094;
        1 | 1000000;1000001;1000002;1000003;1000004;1000005;
    (2 rows)
    
    -- Example(1):
    SELECT id, f1
    FROM (SELECT id, split_part(f1::varchar(65000), ';', row_number() over (PARTITION BY id)) f1
    FROM tv
    CROSS JOIN columns) foo
    WHERE f1 <> ''
    ORDER BY 1,2;
     id |   f1
    ----+---------
      1 | 1000000
      1 | 1000001
      1 | 1000002
      1 | 1000003
      1 | 1000004
      1 | 1000005
      2 | 2000094
      2 | 2000095
      2 | 2000096
      2 | 2000097
      2 | 2000098
      2 | 2000099
    (12 rows)
    
    -- Example(2):
    SELECT id, f2 as f1
    FROM (SELECT id, v_txtindex.StringTokenizerDelim(f1::varchar(65000), ';') OVER (PARTITION BY id ORDER BY id) as f2 FROM tv) foo
    ORDER BY 1, 2;
     id |   f1
    ----+---------
      1 | 1000000
      1 | 1000001
      1 | 1000002
      1 | 1000003
      1 | 1000004
      1 | 1000005
      2 | 2000094
      2 | 2000095
      2 | 2000096
      2 | 2000097
      2 | 2000098
      2 | 2000099
    (12 rows)
    
    
  • aoropezaaoropeza Vertica Customer

    Thank you very much for your answers. They were beneficial.

    Finally, I did the following to work with long varchar(102400) fields.

    I splitted into three queries, fields less than or equal to 65,000 and the fields greater than 65,000 (I divided the content into left and right parts).

    with short_items as
             (select id, v_txtindex.StringTokenizerDelim(items::varchar(65000), ';') over (partition by id) items from foo where length(items) <= 65000),
         left_items as
             (select id, v_txtindex.StringTokenizerDelim(
                 substr(items, 1, regexp_instr(items, ';', floor(length(items) / 2)::int + 1, 1, 1))::varchar(65000), ';') over (partition by id) items from foo where length(items) > 65000),
         right_items as
             (select id, v_txtindex.StringTokenizerDelim(substr(items, regexp_instr(items, ';', floor(length(items) / 2)::int + 1, 1, 1), length(items)::int)::varchar(65000), ';') over (partition by id) items from foo where length(items) > 65000)
    select id, items from short_items
    union all
    select id, items from left_items
    union all
    select id, items from right_items;
    
  • VValdarVValdar Vertica Employee Employee
    edited September 2022

    The function v_txtindex.StringTokenizer can handle long varchar, but only works with a space separator.
    You can replace the semi-colon with a space, but that also means you can't have space in your tokens:

    create table tst_token
    ( id     integer
    , items  long varchar
    )
    order by id
    segmented by hash(id) all nodes;
    
    insert into tst_token values (1, 'a 100k length string generated on http://www.unit-conversion.info/texttools/random-string-generator with semicolon');
    
    -- successive semicolons doesn't create empty rows, those are discarded thus the max_items (in my test it generated 1579 rows)
    select id, length(items), length(items) - length(replace(items, ';', '')) as max_items
      from tst_token;
    
    id    length  max_items
    ----  ------  ---------
       1  100000       1609
    
    -- Works fine:
    select id
         , v_txtindex.StringTokenizer('', replace(items, ';', ' ')) over(partition by id)
      from tst_token;
    
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file