Options

Split a string into parts

This blog post was authored by Jim Knicely.

One of my favorite functions in Vertica is named SPLIT_PART. It splits up a string into parts by a given delimiter.

Example:

dbadmin=> SELECT split_part(my_text, ',', 1) the_first_part,
dbadmin->        split_part(my_text, ',', 2) the_second_part,
dbadmin->        split_part(my_text, ',', 3) the_third_part,
dbadmin->        split_part(my_text, ',', 4) the_fourth_part
dbadmin->   FROM (SELECT 'ONE,TWO,THREE,FOUR' my_text) foo;
the_first_part | the_second_part | the_third_part | the_fourth_part
----------------+-----------------+----------------+-----------------
ONE            | TWO             | THREE          | FOUR
(1 row)

But what if I don’t know how many parts there are in my text? As the following example shows, I can use the ROW NUMBER analytic function to dynamically split my text up!

dbadmin=> SELECT * FROM split_this_up ORDER BY 1;
id |               some_text
----+----------------------------------------
  1 | Please split this sentence up by word
  2 | And do the same with this sentence too
(2 rows)

dbadmin=> SELECT id, word
dbadmin->   FROM (SELECT id, split_part(some_text, ' ', row_number() over (PARTITION BY id)) word
dbadmin(>            FROM split_this_up
dbadmin(>            CROSS JOIN columns) foo
dbadmin->  WHERE word <> '';
id |   word
----+----------
  1 | Please
  1 | split
  1 | this
  1 | sentence
  1 | up
  1 | by
  1 | word
  2 | And
  2 | do
  2 | the
  2 | same
  2 | with
  2 | this
  2 | sentence
  2 | too
(15 rows)

Comments

  • Options
    edited November 2020

    This is neat. What is the table columns in the phrase CROSS JOIN columns and how does that work?

    EDIT Is this just using the system table v_catalog as an arbitrary table that is likely to have more rows than any string to be split has delimiters? For the user dbadmin that is likely to be true, but does that also result in millions of blank rows in the to be filtered out by WHERE word <> ''?

    This might equivalently have been joined by something else, such as...
    CROSS JOIN ( SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 ) bar

  • Options
    marcothesanemarcothesane - Select Field - Administrator

    I tend to agree - ROW_NUMBER() on the v_catalog.columns table , when, in one of the databases I work with, I end up with over 300 million total columns in the database, is a bit like chopping off a tree to get a toothpick ...

    There are actually several ways to do the same thing:

    • SPLIT_PART() with a CROSS JOIN of a series of consecutive INTEGERs;
    • replacing the spaces with a comma, then surrounding some_text with square brackets, do a String_to_Array() on that one, and applying an EXPLODE() on the resulting array
    • The StringTokenizerDelim() function from the Text-Index library .

    The first is the slowest, the last is the fastest, in a 300,000 row table with 100 tokens each .
    Here's a script for you to try, based on Jim's example.
    I do like the UNION ALL SELECT of consecutive integers, up to 10, I'd say.
    If there are more, I "abuse" a side effect of the TIMESERIES clause to get a series of consecutive integers. Note the two CTE-s in the first of the 3 queries ...

    CREATE LOCAL TEMPORARY TABLE
    indata(id,some_text) 
    ON COMMIT PRESERVE ROWS AS (
              SELECT 1,'Please split this sentence up by word'
    UNION ALL SELECT 2,'And do the same with this sentence too'
    )
    ;
    
    SELECT 
      id
    , EXPLODE(
        STRING_TO_ARRAY(
          '['||REPLACE(some_text,' ',',')||']',','
        )::ARRAY[VARCHAR]
      ) OVER(PARTITION BY id) 
      AS (seq,word)
    FROM indata
    ;
    
    WITH
    -- limits ...
    l(l) AS (
              SELECT TIMESTAMPADD(us,  1  , DATE '2000-01-01')
    UNION ALL SELECT TIMESTAMPADD(us, 10  , DATE '2000-01-01')
    )
    ,
    -- series of ints as side effect of TIMESERIES ...
    i(i) AS (
      SELECT
        MICROSECOND(ts)
      FROM l
      TIMESERIES ts AS '1 us' OVER(ORDER BY l)
    )
    SELECT
      id
    , i AS seq
    , SPLIT_PART(some_text ,' ',i) AS nuc
    FROM indata CROSS JOIN i
    WHERE SPLIT_PART(some_text,' ',i) <>''
    ;
    
    WITH
    exploded AS (
      SELECT 
        id
      , v_txtindex.StringTokenizerDelim(some_text,' ') OVER w
      FROM indata
      WINDOW w AS (PARTITION BY id)
    )
    SELECT
      id
    , ROW_NUMBER() OVER w AS seq
    , words AS word
    FROM exploded
    WINDOW w AS (PARTITION BY id)
    ;
    
Sign In or Register to comment.