How to Split column into multiple rows

Siddarth · June 2014

I want to split data in a column into multiple rows. How can do that in Vertica?

SERIALNO | TIME_1 | PATTERN
-----------+---------------------+---------------------------------------------------------------------------------------------------------------
194011502 | 2014-03-17 15:09:08 | 15,1392573670,0;7963,1392573670,0;7964,1392573670,0;17448,1392573670,0;23048,1392573670,0;32648,1392573670,0;

SERIALNO | TIME_1 | PATTERN
-----------------+---------------------------+----------------------
194011502 | 2014-03-17 15:09:08 | 15,1392573670,0
194011502 | 2014-03-17 15:09:08 | 7963,1392573670
194011502 | 2014-03-17 15:09:08 | 7964,1392573670,0
194011502 | 2014-03-17 15:09:08 | 17448,1392573670,0
194011502 | 2014-03-17 15:09:08 | 23048,1392573670,0
194011502 | 2014-03-17 15:09:08 | 32648,1392573670,0

Navin_C · June 2014

Hello Siddarth,

There is a way , we can do this in Vertica.

Consider this example.

Creating a table as mentioned above.


create table navin.test_column_split(
serial_number int,
time_1 timestamp,
pattern varchar(200)
)

Inserting records into this table

nnani=>  select * from navin.test_column_split; serial_number |           time_1           |                                                   pattern
---------------+----------------------------+--------------------------------------------------------------------------------------------------------------
        102123 | 2014-06-27 08:29:11.077139 | 15,1392573670,0;7963,1392573670,0;7964,1392573670,0;17448,1392573670,0;23048,1392573670,0;32648,1392573670,0
        102124 | 2014-06-27 09:03:57.350652 | 15,139257,0;7963,139257,0;7964,1392570,0;17448,1370,0;23048,1393670,0;32648,1373670,0;21365,1245879,0
        102125 | 2014-06-27 09:31:13.343068 | 15,1357,0;7963,1357,0;7964,12570,0;17448,1370,0;23048,13970,0;32648,13770,0
(3 rows)
nnani=>

Then we need to know what is the maximum number of semicolons in the records.

nnani=>   select max(regexp_count(pattern,';')) from navin.test_column_split; max
-----
   6
(1 row)
nnani=>

Here we found the maximum semicolons that a table has in single records is 6.
Remember this number as this is going to be used ahead.

Now we write our main SQL.

SELECT serial_number, time_1,split_part(pattern,';',n.digit+1) AS sep_pattern FROM navin.test_column_split 
  JOIN (SELECT 0 digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6) n
  ON LENGTH(REPLACE(pattern, ';' , '')) <= LENGTH(pattern)-n.digit
 serial_number |           time_1           |    sep_pattern---------------+----------------------------+--------------------
        102123 | 2014-06-27 08:29:11.077139 | 15,1392573670,0
        102123 | 2014-06-27 08:29:11.077139 | 7963,1392573670,0
        102123 | 2014-06-27 08:29:11.077139 | 7964,1392573670,0
        102123 | 2014-06-27 08:29:11.077139 | 17448,1392573670,0
        102123 | 2014-06-27 08:29:11.077139 | 23048,1392573670,0
        102123 | 2014-06-27 08:29:11.077139 | 32648,1392573670,0
        102124 | 2014-06-27 09:03:57.350652 | 15,139257,0
        102124 | 2014-06-27 09:03:57.350652 | 7963,139257,0
        102124 | 2014-06-27 09:03:57.350652 | 7964,1392570,0
        102124 | 2014-06-27 09:03:57.350652 | 17448,1370,0
        102124 | 2014-06-27 09:03:57.350652 | 23048,1393670,0
        102124 | 2014-06-27 09:03:57.350652 | 32648,1373670,0
        102124 | 2014-06-27 09:03:57.350652 | 21365,1245879,0
        102125 | 2014-06-27 09:31:13.343068 | 15,1357,0
        102125 | 2014-06-27 09:31:13.343068 | 7963,1357,0
        102125 | 2014-06-27 09:31:13.343068 | 7964,12570,0
        102125 | 2014-06-27 09:31:13.343068 | 17448,1370,0
        102125 | 2014-06-27 09:31:13.343068 | 23048,13970,0
        102125 | 2014-06-27 09:31:13.343068 | 32648,13770,0
(19 rows)

Note - This SELECT....UNION CLAUSE uses the number of maximum semicolons we found before this step, accordingly you need to increment the UNION clause starting from index 0.

This is how we can get a column split into number of rows according to delimiter.

Hope this helps.

Sajan_Sangraula · June 2014

Great solution from Navin. A similar approach, but something that you can do in a single pass below. You should use the maximum number of patterns that can exist in a single input row in the inline view and also do the same number of split parts, the null condition check in the where clause will make sure the right number of rows are returned. This query works even when the input table has variable number of patterns in different rows.

select a.serialno,a.time_1,decode(b.idx,1,p1,2,p2,3,p3,4,p4,5,p5,6,p6,7,p7) as ptn from (select
serialno,
time_1,
split_part(pattern,';',1) as p1,
split_part(pattern,';',2) as p2,
split_part(pattern,';',3) as p3,
split_part(pattern,';',4) as p4,
split_part(pattern,';',5) as p5,
split_part(pattern,';',6) as p6,
split_part(pattern,';',7) as p7
from pattern_input) a
cross join (select 1 as idx union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7) b where decode(b.idx,1,p1,2,p2,3,p3,4,p4,5,p5,6,p6,7,p7) is not null;

Siddarth · June 2014

Thanks both for the solution

I have used a tokenizer UDF for implementing this.

dbadmin=> SELECT * FROM data_stg;

SERIALNO | TIME_1 | PATTERN

-----------+---------------------+---------------------------------------------------------------------------------------------------------------

194011502 | 2014-03-17 15:09:08 | 15,1392573670,0;7963,1392573670,0;7964,1392573670,0;17448,1392573670,0;23048,1392573670,0;32648,1392573670,0;

194395874 | 2014-04-17 15:09:08 | 15,1392573670,0;7963,1392573670,0;7964,1392573670,0;17448,1392573670,0;23048,1392573670,0;32648,1392573670,0;

194395876 | 2014-02-17 15:09:08 | 15,1392573670,0;7963,1392573670,0;7964,1392573670,0;17448,1392573670,0;23048,1392573670,0;32648,1392573670,0;

(4 rows)

dbadmin=> SELECT SERIALNO,TIME_1,tokenize(replace(PATTERN,';',' ')) OVER (partition by SERIALNO,TIME_1) as PATTERN FROM data_stg;

SERIALNO | TIME_1 | PATTERN

-----------+---------------------+--------------------

194011502 | 2014-03-17 15:09:08 | 15,1392573670,0

194011502 | 2014-03-17 15:09:08 | 7963,1392573670,0

194011502 | 2014-03-17 15:09:08 | 7964,1392573670,0

194011502 | 2014-03-17 15:09:08 | 17448,1392573670,0

194011502 | 2014-03-17 15:09:08 | 23048,1392573670,0

194011502 | 2014-03-17 15:09:08 | 32648,1392573670,0

194011502 | 2014-03-17 15:09:08 | 15,1392573670,0

194011502 | 2014-03-17 15:09:08 | 7963,1392573670,0

194011502 | 2014-03-17 15:09:08 | 7964,1392573670,0

194011502 | 2014-03-17 15:09:08 | 17448,1392573670,0

194011502 | 2014-03-17 15:09:08 | 23048,1392573670,0

194011502 | 2014-03-17 15:09:08 | 32648,1392573670,0

194395874 | 2014-04-17 15:09:08 | 15,1392573670,0

194395874 | 2014-04-17 15:09:08 | 7963,1392573670,0

194395874 | 2014-04-17 15:09:08 | 7964,1392573670,0

194395874 | 2014-04-17 15:09:08 | 17448,1392573670,0

194395874 | 2014-04-17 15:09:08 | 23048,1392573670,0

194395874 | 2014-04-17 15:09:08 | 32648,1392573670,0

194395876 | 2014-02-17 15:09:08 | 15,1392573670,0

194395876 | 2014-02-17 15:09:08 | 7963,1392573670,0

194395876 | 2014-02-17 15:09:08 | 7964,1392573670,0

194395876 | 2014-02-17 15:09:08 | 17448,1392573670,0

194395876 | 2014-02-17 15:09:08 | 23048,1392573670,0

194395876 | 2014-02-17 15:09:08 | 32648,1392573670,0

(24 rows)

Adrian_Oprea_1 · June 2014

Why do i get the felling you already had this solution before you asked the question ?

What is great about this is that we can see Vertica can take care of the business in many way !

Great solution !

Siddarth · June 2014

Hi Adrian, solution was already there but I did not wanted to use UDF for implementing it. We are getting few columns of our cdr data in this format and I was looking for some sql based generic solution.

We're Moving!

Create My New Community Account Now

How to Split column into multiple rows

Comments

Leave a Comment