Vertica - Split a column based on length

ic10134 · April 2020

All,
We are trying to split a column based on size. For example:

ID|text
1|abcdefg
2|hijk
3|lm

We need it be split at text size: 3 along with a seq_no

ID|new_text|seq_no
1|abc|1
1|def|2
1|g|3
2|hij|1
2|k|2
3|lm|1

Is there anyway this can be achieved this. I have seen posts that do based on a delimiter but not size. Please assist if anyone is aware of a neat trick.

Jim_Knicely · April 2020

Here's one crazy way

dbadmin=> SELECT * FROM test ORDER BY 1;
 ID |  text
----+---------
  1 | abcdefg
  2 | hijk
  3 | lm
(3 rows)

dbadmin=> SELECT ID, REPLACE(LISTAGG(text), ',', '') new_text, part+1 seq_no FROM (SELECT ID, TRUNC(rn / 3.01)::INT part, SUBSTR(text, rn, 1) text FROM (SELECT ID, text, row_number() OVER (PARTITION BY ID) rn FROM test CROSS JOIN COLUMNS) foo WHERE rn <= LENGTH(text) ORDER BY 1, 2) foo GROUP BY ID, part ORDER BY 1;
 ID | new_text | seq_no
----+----------+--------
  1 | abc      |      1
  1 | def      |      2
  1 | g        |      3
  2 | hij      |      1
  2 | k        |      2
  3 | lm       |      1
(6 rows)

marcothesane · April 2020

Just cross join with a series of integers. Like so:

WITH
input(ID,text) AS (
          SELECT 1,'abcdefg'
UNION ALL SELECT 2,'hijk'
UNION ALL SELECT 3,'lm'
)
,
i(i) AS (
          SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
)
SELECT 
  id
, SUBSTR(text,(i-1)*3+1,3) AS new_text
, i AS seq_no
FROM input CROSS JOIN i
WHERE SUBSTR(text,(i-1)*3+1,3) <>''
ORDER BY id,i
;
-- out  id | new_text | seq_no 
-- out ----+----------+--------
-- out   1 | abc      |      1
-- out   1 | def      |      2
-- out   1 | g        |      3
-- out   2 | hij      |      1
-- out   2 | k        |      2
-- out   3 | lm       |      1
-- out (6 rows)
-- out 
-- out Time: First fetch (6 rows): 11.334 ms. All rows formatted: 11.372 ms

marcothesane · April 2020

I created a table with 50,000 rows and tried mine . I created a temp table to fill, so that the fetch loop did not penalise the performance, and took 79 milliseconds to build it ...

SELECT * FROM test limit 10;
-- out  id |  text   
-- out ----+---------
-- out   1 | 1VIAgFg
-- out   2 | IfPLHbT
-- out   3 | EWOmXAx
-- out   4 | zl8paoh
-- out   5 | 9EpQ9Kx
-- out   6 | ZpagcCh
-- out   7 | 6xoVoit
-- out   8 | mCniu1U
-- out   9 | euieQZa
-- out  10 | pOkarqe

SELECT COUNT(*) FROM test;
-- out  COUNT 
-- out -------
-- out  50000

CREATE LOCAL TEMPORARY TABLE foo
ON COMMIT PRESERVE ROWS AS
WITH
i(i) AS (
          SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
)
SELECT 
  id
, SUBSTR(text,(i-1)*3+1,3)::VARCHAR(3) AS new_text
, i AS seq_no
FROM test CROSS JOIN i
WHERE SUBSTR(text,(i-1)*3+1,3) <>''
ORDER BY id,i
;
-- out CREATE TABLE
-- out Time: First fetch (0 rows): 79.198 ms. All rows formatted: 79.221 ms

SELECT COUNT(*) FROM foo;
-- out  COUNT  
-- out --------
-- out  150000

SELECT * FROM foo limit 10;
-- out  id | new_text | seq_no 
-- out ----+----------+--------
-- out   1 | 1VI      |      1
-- out   1 | AgF      |      2
-- out   1 | g        |      3
-- out   2 | IfP      |      1
-- out   2 | LHb      |      2
-- out   2 | T        |      3
-- out   3 | EWO      |      1
-- out   3 | mXA      |      2
-- out   3 | x        |      3
-- out   4 | zl8      |      1

What did you do differently?

SruthiA · April 2020

I think you can use SUBSTR function to achieve the functionality you are looking for. Please find the sample example below.

dbadmin=> select * from split_test;
ID | text
----+-----------
1 | abc
3 | abcefg
2 | xvfabcefg
(3 rows)

dbadmin=> select substr(text, 0, ID+1) from split_test;

substr

a
abc
xv
(3 rows)

dbadmin=>

ic10134 · April 2020

Hi Sruthi, we are not splitting at 3 bytes for each row. Also seq number is not added.

SruthiA · April 2020

I am splitting dynamically based on the ID column, considering it to be the size for splitting. So, do you want to split of length 3 irrespective of the length of the string?

ic10134 · April 2020

Yes Sruthi. We want to split at length 3 irrespective of the length of the string and store sequence number for each record

SruthiA · April 2020

In such a scenario, I think you can write UDX to achieve this functionality. Please find the link below which contains examples of Vertica UDx

https://github.com/vertica/UDx-Examples

marcothesane · April 2020

Oh - and if it's a lot more than 3 integers, go:

WITH
i(i) AS (
  SELECT MICROSECOND(tm)
  FROM (
              SELECT TIMESTAMPADD(MICROSECOND,   1,'2000-01-01'::TIMESTAMP)
    UNION ALL SELECT TIMESTAMPADD(MICROSECOND,1000,'2000-01-01'::TIMESTAMP)
  ) l(ts)
  TIMESERIES tm AS '1 MICROSECOND' OVER(ORDER BY ts)
)
SELECT * FROM i;

ic10134 · April 2020

Thanks Jim_Knicely & marcothesane. Do you forsee any performance issues using cross join for a table with about ~100 million rows.

marcothesane · April 2020

Hi @ic10134
A CROSS JOIN is always problematic, sure - but it's still the best means, in my experience, to expand an existing set of rows like in the way you need it.
If the series of integers goes above the millions of rows - you might want to be careful - but I'm not all too worried in the case of a dozen or so rows ....
Happy playing ...

ic10134 · April 2020

Thanks, but it's seems to be taking for ever even for 10k records. Is there any other way this can be achieved without CROSS JOIN COLUMNS

Jim_Knicely · April 2020

This simple test works pretty fast..

dbadmin=> SELECT COUNT(*) FROM test;
 COUNT
-------
 10000
(1 row)

Time: First fetch (1 row): 29.254 ms. All rows formatted: 29.296 ms

dbadmin=> SELECT * FROM test WHERE ID <= 10 ORDER BY ID; -- Sample Data
 ID |     text
----+---------------
  1 | abcdefg
  1 | eqwugci
  2 | hijk
  2 | xqjs
  3 | lm
  3 | md
  4 | qdavwybpcxse
  5 | gapftieum
  6 | onchqrvjwcle
  7 | wsmmiaixexj
  8 | fjeyqaqqhynf
  9 | bvryaocvd
 10 | aspmwvsloyrpd
(13 rows)

Time: First fetch (13 rows): 28.383 ms. All rows formatted: 28.479 ms

dbadmin=> SELECT ID, REPLACE(LISTAGG(text), ',', '') new_text, part+1 seq_no FROM (SELECT ID, TRUNC(rn / 3.01)::INT part, SUBSTR(text, rn, 1) text FROM (SELECT ID, text, row_number() OVER (PARTITION BY ID) rn FROM test CROSS JOIN COLUMNS) foo WHERE rn <= LENGTH(text) ORDER BY 1, 2) foo WHERE ID <= 10 GROUP BY ID, part ORDER BY 1;
 ID | new_text | seq_no
----+----------+--------
  1 | abc      |      1
  1 | def      |      2
  1 | g        |      3
  2 | hji      |      1
  2 | k        |      2
  3 | ml       |      1
  4 | qda      |      1
  4 | vwy      |      2
  4 | bpc      |      3
  4 | xse      |      4
  5 | gap      |      1
  5 | itf      |      2
  5 | eum      |      3
  6 | onc      |      1
  6 | hqr      |      2
  6 | vjw      |      3
  6 | cle      |      4
  7 | wsm      |      1
  7 | mia      |      2
  7 | ixe      |      3
  7 | xj       |      4
  8 | fje      |      1
  8 | yqa      |      2
  8 | qqh      |      3
  8 | ynf      |      4
  9 | bvr      |      1
  9 | oay      |      2
  9 | cvd      |      3
 10 | asp      |      1
 10 | mwv      |      2
 10 | slo      |      3
 10 | yrp      |      4
 10 | d        |      5
(33 rows)

Time: First fetch (33 rows): 80.394 ms. All rows formatted: 80.589 ms

dbadmin=> \o /dev/null

dbadmin=> SELECT ID, REPLACE(LISTAGG(text), ',', '') new_text, part+1 seq_no FROM (SELECT ID, TRUNC(rn / 3.01)::INT part, SUBSTR(text, rn, 1) text FROM (SELECT ID, text, row_number() OVER (PARTITION BY ID) rn FROM test CROSS JOIN COLUMNS) foo WHERE rn <= LENGTH(text) ORDER BY 1, 2) foo GROUP BY ID, part ORDER BY 1; -- Process 10K
Time: First fetch (1000 rows): 1620.042 ms. All rows formatted: 1678.680 ms

dbadmin=> \o

ic10134 · April 2020

Right I understood that. But running for 10k-50k rows dataset is taking for ever

Jim_Knicely · April 2020

Takes about 7 seconds to process 50K on my tiny cluster:

dbadmin=> SELECT COUNT(*) FROM test2;
 COUNT
-------
 50000
(1 row)

dbadmin=> SELECT * FROM test2 WHERE ID <= 10 ORDER BY 1;
 ID |     text
----+---------------
  1 | abcdefg
  2 | hijk
  3 | lm
  4 | igbwilo
  5 | xjnlkwmvxvwb
  6 | xedayfipg
  7 | cbuyyxmjm
  8 | abrgjjtx
  9 | parhtbwgtpeqa
 10 | icfgmub
(10 rows)

dbadmin=> \timing
Timing is on.

dbadmin=> \o /dev/null

dbadmin=> SELECT ID, REPLACE(LISTAGG(text), ',', '') new_text, part+1 seq_no FROM (SELECT ID, TRUNC(rn / 3.01)::INT part, SUBSTR(text, rn, 1) text FROM (SELECT ID, text, row_number() OVER (PARTITION BY ID) rn FROM test CROSS JOIN COLUMNS) foo WHERE rn <= LENGTH(text) ORDER BY 1, 2) foo GROUP BY ID, part ORDER BY 1; -- Process 50K
Time: First fetch (1000 rows): 6694.882 ms. All rows formatted: 6952.983 ms

dbadmin=> \o

ic10134 · April 2020

Thanks marcothesane. This is really neat! Will there be any issue if the text has space or comma in the content

marcothesane · April 2020

Well - the space / comma will end up in the middle of a split string ... just try it ...

Vertica - Split a column based on length

Best Answers

Answers

substr

Leave a Comment