Vertica - Split a column based on length
ic10134
✭
All,
We are trying to split a column based on size. For example:
ID|text
1|abcdefg
2|hijk
3|lm
We need it be split at text size: 3 along with a seq_no
ID|new_text|seq_no
1|abc|1
1|def|2
1|g|3
2|hij|1
2|k|2
3|lm|1
Is there anyway this can be achieved this. I have seen posts that do based on a delimiter but not size. Please assist if anyone is aware of a neat trick.
0
Best Answers
-
Jim_Knicely - Select Field - Administrator
Here's one crazy way
dbadmin=> SELECT * FROM test ORDER BY 1; ID | text ----+--------- 1 | abcdefg 2 | hijk 3 | lm (3 rows) dbadmin=> SELECT ID, REPLACE(LISTAGG(text), ',', '') new_text, part+1 seq_no FROM (SELECT ID, TRUNC(rn / 3.01)::INT part, SUBSTR(text, rn, 1) text FROM (SELECT ID, text, row_number() OVER (PARTITION BY ID) rn FROM test CROSS JOIN COLUMNS) foo WHERE rn <= LENGTH(text) ORDER BY 1, 2) foo GROUP BY ID, part ORDER BY 1; ID | new_text | seq_no ----+----------+-------- 1 | abc | 1 1 | def | 2 1 | g | 3 2 | hij | 1 2 | k | 2 3 | lm | 1 (6 rows)
5 -
marcothesane - Select Field - Administrator
Just cross join with a series of integers. Like so:
WITH input(ID,text) AS ( SELECT 1,'abcdefg' UNION ALL SELECT 2,'hijk' UNION ALL SELECT 3,'lm' ) , i(i) AS ( SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 ) SELECT id , SUBSTR(text,(i-1)*3+1,3) AS new_text , i AS seq_no FROM input CROSS JOIN i WHERE SUBSTR(text,(i-1)*3+1,3) <>'' ORDER BY id,i ; -- out id | new_text | seq_no -- out ----+----------+-------- -- out 1 | abc | 1 -- out 1 | def | 2 -- out 1 | g | 3 -- out 2 | hij | 1 -- out 2 | k | 2 -- out 3 | lm | 1 -- out (6 rows) -- out -- out Time: First fetch (6 rows): 11.334 ms. All rows formatted: 11.372 ms
5 -
marcothesane - Select Field - Administrator
I created a table with 50,000 rows and tried mine . I created a temp table to fill, so that the fetch loop did not penalise the performance, and took 79 milliseconds to build it ...
SELECT * FROM test limit 10; -- out id | text -- out ----+--------- -- out 1 | 1VIAgFg -- out 2 | IfPLHbT -- out 3 | EWOmXAx -- out 4 | zl8paoh -- out 5 | 9EpQ9Kx -- out 6 | ZpagcCh -- out 7 | 6xoVoit -- out 8 | mCniu1U -- out 9 | euieQZa -- out 10 | pOkarqe SELECT COUNT(*) FROM test; -- out COUNT -- out ------- -- out 50000 CREATE LOCAL TEMPORARY TABLE foo ON COMMIT PRESERVE ROWS AS WITH i(i) AS ( SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 ) SELECT id , SUBSTR(text,(i-1)*3+1,3)::VARCHAR(3) AS new_text , i AS seq_no FROM test CROSS JOIN i WHERE SUBSTR(text,(i-1)*3+1,3) <>'' ORDER BY id,i ; -- out CREATE TABLE -- out Time: First fetch (0 rows): 79.198 ms. All rows formatted: 79.221 ms SELECT COUNT(*) FROM foo; -- out COUNT -- out -------- -- out 150000 SELECT * FROM foo limit 10; -- out id | new_text | seq_no -- out ----+----------+-------- -- out 1 | 1VI | 1 -- out 1 | AgF | 2 -- out 1 | g | 3 -- out 2 | IfP | 1 -- out 2 | LHb | 2 -- out 2 | T | 3 -- out 3 | EWO | 1 -- out 3 | mXA | 2 -- out 3 | x | 3 -- out 4 | zl8 | 1
What did you do differently?
1
Answers
I think you can use SUBSTR function to achieve the functionality you are looking for. Please find the sample example below.
dbadmin=> select * from split_test;
ID | text
----+-----------
1 | abc
3 | abcefg
2 | xvfabcefg
(3 rows)
dbadmin=> select substr(text, 0, ID+1) from split_test;
substr
a
abc
xv
(3 rows)
dbadmin=> select id, text, substr(text, 0, ID+1) from split_test;
id | text | substr
----+-----------+--------
1 | abc | a
2 | xvfabcefg | xv
3 | abcefg | abc
(3 rows)
dbadmin=>
Hi Sruthi, we are not splitting at 3 bytes for each row. Also seq number is not added.
I am splitting dynamically based on the ID column, considering it to be the size for splitting. So, do you want to split of length 3 irrespective of the length of the string?
Yes Sruthi. We want to split at length 3 irrespective of the length of the string and store sequence number for each record
In such a scenario, I think you can write UDX to achieve this functionality. Please find the link below which contains examples of Vertica UDx
https://github.com/vertica/UDx-Examples
Oh - and if it's a lot more than 3 integers, go:
Thanks Jim_Knicely & marcothesane. Do you forsee any performance issues using cross join for a table with about ~100 million rows.
Hi @ic10134
A CROSS JOIN is always problematic, sure - but it's still the best means, in my experience, to expand an existing set of rows like in the way you need it.
If the series of integers goes above the millions of rows - you might want to be careful - but I'm not all too worried in the case of a dozen or so rows ....
Happy playing ...
Thanks, but it's seems to be taking for ever even for 10k records. Is there any other way this can be achieved without CROSS JOIN COLUMNS
This simple test works pretty fast..
Right I understood that. But running for 10k-50k rows dataset is taking for ever
Takes about 7 seconds to process 50K on my tiny cluster:
Thanks marcothesane. This is really neat! Will there be any issue if the text has space or comma in the content
Well - the space / comma will end up in the middle of a split string ... just try it ...