Split a string into parts

[Deleted User] · February 2018

This blog post was authored by Jim Knicely.

One of my favorite functions in Vertica is named SPLIT_PART. It splits up a string into parts by a given delimiter.

Example:

dbadmin=> SELECT split_part(my_text, ',', 1) the_first_part,
dbadmin->        split_part(my_text, ',', 2) the_second_part,
dbadmin->        split_part(my_text, ',', 3) the_third_part,
dbadmin->        split_part(my_text, ',', 4) the_fourth_part
dbadmin->   FROM (SELECT 'ONE,TWO,THREE,FOUR' my_text) foo;
the_first_part | the_second_part | the_third_part | the_fourth_part
----------------+-----------------+----------------+-----------------
ONE            | TWO             | THREE          | FOUR
(1 row)

But what if I don’t know how many parts there are in my text? As the following example shows, I can use the ROW NUMBER analytic function to dynamically split my text up!

dbadmin=> SELECT * FROM split_this_up ORDER BY 1;
id |               some_text
----+----------------------------------------
  1 | Please split this sentence up by word
  2 | And do the same with this sentence too
(2 rows)

dbadmin=> SELECT id, word
dbadmin->   FROM (SELECT id, split_part(some_text, ' ', row_number() over (PARTITION BY id)) word
dbadmin(>            FROM split_this_up
dbadmin(>            CROSS JOIN columns) foo
dbadmin->  WHERE word <> '';
id |   word
----+----------
  1 | Please
  1 | split
  1 | this
  1 | sentence
  1 | up
  1 | by
  1 | word
  2 | And
  2 | do
  2 | the
  2 | same
  2 | with
  2 | this
  2 | sentence
  2 | too
(15 rows)

Pieter_Sheth-Vo · November 2020

This is neat. What is the table columns in the phrase CROSS JOIN columns and how does that work?

EDIT Is this just using the system table v_catalog as an arbitrary table that is likely to have more rows than any string to be split has delimiters? For the user dbadmin that is likely to be true, but does that also result in millions of blank rows in the to be filtered out by WHERE word <> ''?

This might equivalently have been joined by something else, such as...
CROSS JOIN ( SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 ) bar

marcothesane · November 2020

I tend to agree - ROW_NUMBER() on the v_catalog.columns table , when, in one of the databases I work with, I end up with over 300 million total columns in the database, is a bit like chopping off a tree to get a toothpick ...

There are actually several ways to do the same thing:

SPLIT_PART() with a CROSS JOIN of a series of consecutive INTEGERs;
replacing the spaces with a comma, then surrounding some_text with square brackets, do a String_to_Array() on that one, and applying an EXPLODE() on the resulting array
The StringTokenizerDelim() function from the Text-Index library .

The first is the slowest, the last is the fastest, in a 300,000 row table with 100 tokens each .
Here's a script for you to try, based on Jim's example.
I do like the UNION ALL SELECT of consecutive integers, up to 10, I'd say.
If there are more, I "abuse" a side effect of the TIMESERIES clause to get a series of consecutive integers. Note the two CTE-s in the first of the 3 queries ...

CREATE LOCAL TEMPORARY TABLE
indata(id,some_text) 
ON COMMIT PRESERVE ROWS AS (
          SELECT 1,'Please split this sentence up by word'
UNION ALL SELECT 2,'And do the same with this sentence too'
)
;

SELECT 
  id
, EXPLODE(
    STRING_TO_ARRAY(
      '['||REPLACE(some_text,' ',',')||']',','
    )::ARRAY[VARCHAR]
  ) OVER(PARTITION BY id) 
  AS (seq,word)
FROM indata
;

WITH
-- limits ...
l(l) AS (
          SELECT TIMESTAMPADD(us,  1  , DATE '2000-01-01')
UNION ALL SELECT TIMESTAMPADD(us, 10  , DATE '2000-01-01')
)
,
-- series of ints as side effect of TIMESERIES ...
i(i) AS (
  SELECT
    MICROSECOND(ts)
  FROM l
  TIMESERIES ts AS '1 us' OVER(ORDER BY l)
)
SELECT
  id
, i AS seq
, SPLIT_PART(some_text ,' ',i) AS nuc
FROM indata CROSS JOIN i
WHERE SPLIT_PART(some_text,' ',i) <>''
;

WITH
exploded AS (
  SELECT 
    id
  , v_txtindex.StringTokenizerDelim(some_text,' ') OVER w
  FROM indata
  WINDOW w AS (PARTITION BY id)
)
SELECT
  id
, ROW_NUMBER() OVER w AS seq
, words AS word
FROM exploded
WINDOW w AS (PARTITION BY id)
;

We're Moving!

Create My New Community Account Now

Split a string into parts

Comments