Vertica - Split a column based on length

ic10134ic10134
edited April 2020 in General Discussion

All,
We are trying to split a column based on size. For example:

ID|text
1|abcdefg
2|hijk
3|lm

We need it be split at text size: 3 along with a seq_no

ID|new_text|seq_no
1|abc|1
1|def|2
1|g|3
2|hij|1
2|k|2
3|lm|1

Is there anyway this can be achieved this. I have seen posts that do based on a delimiter but not size. Please assist if anyone is aware of a neat trick.

Tagged:

Best Answers

  • marcothesanemarcothesane - Select Field - Administrator
    Answer ✓

    I created a table with 50,000 rows and tried mine . I created a temp table to fill, so that the fetch loop did not penalise the performance, and took 79 milliseconds to build it ...

    SELECT * FROM test limit 10;
    -- out  id |  text   
    -- out ----+---------
    -- out   1 | 1VIAgFg
    -- out   2 | IfPLHbT
    -- out   3 | EWOmXAx
    -- out   4 | zl8paoh
    -- out   5 | 9EpQ9Kx
    -- out   6 | ZpagcCh
    -- out   7 | 6xoVoit
    -- out   8 | mCniu1U
    -- out   9 | euieQZa
    -- out  10 | pOkarqe
    
    SELECT COUNT(*) FROM test;
    -- out  COUNT 
    -- out -------
    -- out  50000
    
    CREATE LOCAL TEMPORARY TABLE foo
    ON COMMIT PRESERVE ROWS AS
    WITH
    i(i) AS (
              SELECT 1
    UNION ALL SELECT 2
    UNION ALL SELECT 3
    )
    SELECT 
      id
    , SUBSTR(text,(i-1)*3+1,3)::VARCHAR(3) AS new_text
    , i AS seq_no
    FROM test CROSS JOIN i
    WHERE SUBSTR(text,(i-1)*3+1,3) <>''
    ORDER BY id,i
    ;
    -- out CREATE TABLE
    -- out Time: First fetch (0 rows): 79.198 ms. All rows formatted: 79.221 ms
    
    SELECT COUNT(*) FROM foo;
    -- out  COUNT  
    -- out --------
    -- out  150000
    
    SELECT * FROM foo limit 10;
    -- out  id | new_text | seq_no 
    -- out ----+----------+--------
    -- out   1 | 1VI      |      1
    -- out   1 | AgF      |      2
    -- out   1 | g        |      3
    -- out   2 | IfP      |      1
    -- out   2 | LHb      |      2
    -- out   2 | T        |      3
    -- out   3 | EWO      |      1
    -- out   3 | mXA      |      2
    -- out   3 | x        |      3
    -- out   4 | zl8      |      1
    

    What did you do differently?

Answers

  • SruthiASruthiA Vertica Employee Administrator
    edited April 2020

    I think you can use SUBSTR function to achieve the functionality you are looking for. Please find the sample example below.

    dbadmin=> select * from split_test;
    ID | text
    ----+-----------
    1 | abc
    3 | abcefg
    2 | xvfabcefg
    (3 rows)

    dbadmin=> select substr(text, 0, ID+1) from split_test;

    substr

    a
    abc
    xv
    (3 rows)

    dbadmin=> select id, text, substr(text, 0, ID+1) from split_test;
    id | text | substr
    ----+-----------+--------
    1 | abc | a
    2 | xvfabcefg | xv
    3 | abcefg | abc
    (3 rows)

    dbadmin=>

  • Hi Sruthi, we are not splitting at 3 bytes for each row. Also seq number is not added.

  • SruthiASruthiA Vertica Employee Administrator

    I am splitting dynamically based on the ID column, considering it to be the size for splitting. So, do you want to split of length 3 irrespective of the length of the string?

  • Yes Sruthi. We want to split at length 3 irrespective of the length of the string and store sequence number for each record

  • SruthiASruthiA Vertica Employee Administrator

    In such a scenario, I think you can write UDX to achieve this functionality. Please find the link below which contains examples of Vertica UDx

    https://github.com/vertica/UDx-Examples

  • marcothesanemarcothesane - Select Field - Administrator

    Oh - and if it's a lot more than 3 integers, go:

    WITH
    i(i) AS (
      SELECT MICROSECOND(tm)
      FROM (
                  SELECT TIMESTAMPADD(MICROSECOND,   1,'2000-01-01'::TIMESTAMP)
        UNION ALL SELECT TIMESTAMPADD(MICROSECOND,1000,'2000-01-01'::TIMESTAMP)
      ) l(ts)
      TIMESERIES tm AS '1 MICROSECOND' OVER(ORDER BY ts)
    )
    SELECT * FROM i;
    
  • Thanks Jim_Knicely & marcothesane. Do you forsee any performance issues using cross join for a table with about ~100 million rows.

  • marcothesanemarcothesane - Select Field - Administrator

    Hi @ic10134
    A CROSS JOIN is always problematic, sure - but it's still the best means, in my experience, to expand an existing set of rows like in the way you need it.
    If the series of integers goes above the millions of rows - you might want to be careful - but I'm not all too worried in the case of a dozen or so rows ....
    Happy playing ...

  • Thanks, but it's seems to be taking for ever even for 10k records. Is there any other way this can be achieved without CROSS JOIN COLUMNS

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited April 2020

    This simple test works pretty fast..

    dbadmin=> SELECT COUNT(*) FROM test;
     COUNT
    -------
     10000
    (1 row)
    
    Time: First fetch (1 row): 29.254 ms. All rows formatted: 29.296 ms
    
    dbadmin=> SELECT * FROM test WHERE ID <= 10 ORDER BY ID; -- Sample Data
     ID |     text
    ----+---------------
      1 | abcdefg
      1 | eqwugci
      2 | hijk
      2 | xqjs
      3 | lm
      3 | md
      4 | qdavwybpcxse
      5 | gapftieum
      6 | onchqrvjwcle
      7 | wsmmiaixexj
      8 | fjeyqaqqhynf
      9 | bvryaocvd
     10 | aspmwvsloyrpd
    (13 rows)
    
    Time: First fetch (13 rows): 28.383 ms. All rows formatted: 28.479 ms
    
    dbadmin=> SELECT ID, REPLACE(LISTAGG(text), ',', '') new_text, part+1 seq_no FROM (SELECT ID, TRUNC(rn / 3.01)::INT part, SUBSTR(text, rn, 1) text FROM (SELECT ID, text, row_number() OVER (PARTITION BY ID) rn FROM test CROSS JOIN COLUMNS) foo WHERE rn <= LENGTH(text) ORDER BY 1, 2) foo WHERE ID <= 10 GROUP BY ID, part ORDER BY 1;
     ID | new_text | seq_no
    ----+----------+--------
      1 | abc      |      1
      1 | def      |      2
      1 | g        |      3
      2 | hji      |      1
      2 | k        |      2
      3 | ml       |      1
      4 | qda      |      1
      4 | vwy      |      2
      4 | bpc      |      3
      4 | xse      |      4
      5 | gap      |      1
      5 | itf      |      2
      5 | eum      |      3
      6 | onc      |      1
      6 | hqr      |      2
      6 | vjw      |      3
      6 | cle      |      4
      7 | wsm      |      1
      7 | mia      |      2
      7 | ixe      |      3
      7 | xj       |      4
      8 | fje      |      1
      8 | yqa      |      2
      8 | qqh      |      3
      8 | ynf      |      4
      9 | bvr      |      1
      9 | oay      |      2
      9 | cvd      |      3
     10 | asp      |      1
     10 | mwv      |      2
     10 | slo      |      3
     10 | yrp      |      4
     10 | d        |      5
    (33 rows)
    
    Time: First fetch (33 rows): 80.394 ms. All rows formatted: 80.589 ms
    
    dbadmin=> \o /dev/null
    
    dbadmin=> SELECT ID, REPLACE(LISTAGG(text), ',', '') new_text, part+1 seq_no FROM (SELECT ID, TRUNC(rn / 3.01)::INT part, SUBSTR(text, rn, 1) text FROM (SELECT ID, text, row_number() OVER (PARTITION BY ID) rn FROM test CROSS JOIN COLUMNS) foo WHERE rn <= LENGTH(text) ORDER BY 1, 2) foo GROUP BY ID, part ORDER BY 1; -- Process 10K
    Time: First fetch (1000 rows): 1620.042 ms. All rows formatted: 1678.680 ms
    
    dbadmin=> \o
    
  • Right I understood that. But running for 10k-50k rows dataset is taking for ever

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited April 2020

    Takes about 7 seconds to process 50K on my tiny cluster:

    dbadmin=> SELECT COUNT(*) FROM test2;
     COUNT
    -------
     50000
    (1 row)
    
    dbadmin=> SELECT * FROM test2 WHERE ID <= 10 ORDER BY 1;
     ID |     text
    ----+---------------
      1 | abcdefg
      2 | hijk
      3 | lm
      4 | igbwilo
      5 | xjnlkwmvxvwb
      6 | xedayfipg
      7 | cbuyyxmjm
      8 | abrgjjtx
      9 | parhtbwgtpeqa
     10 | icfgmub
    (10 rows)
    
    dbadmin=> \timing
    Timing is on.
    
    dbadmin=> \o /dev/null
    
    dbadmin=> SELECT ID, REPLACE(LISTAGG(text), ',', '') new_text, part+1 seq_no FROM (SELECT ID, TRUNC(rn / 3.01)::INT part, SUBSTR(text, rn, 1) text FROM (SELECT ID, text, row_number() OVER (PARTITION BY ID) rn FROM test CROSS JOIN COLUMNS) foo WHERE rn <= LENGTH(text) ORDER BY 1, 2) foo GROUP BY ID, part ORDER BY 1; -- Process 50K
    Time: First fetch (1000 rows): 6694.882 ms. All rows formatted: 6952.983 ms
    
    dbadmin=> \o
    
  • Thanks marcothesane. This is really neat! Will there be any issue if the text has space or comma in the content

  • marcothesanemarcothesane - Select Field - Administrator

    Well - the space / comma will end up in the middle of a split string ... just try it ...

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file