Split a string into parts
![[Deleted User]](https://us.v-cdn.net/6029397/uploads/defaultavatar/nD0LWW9MQTB29.jpg)
This blog post was authored by Jim Knicely.
One of my favorite functions in Vertica is named SPLIT_PART. It splits up a string into parts by a given delimiter.
Example:
dbadmin=> SELECT split_part(my_text, ',', 1) the_first_part, dbadmin-> split_part(my_text, ',', 2) the_second_part, dbadmin-> split_part(my_text, ',', 3) the_third_part, dbadmin-> split_part(my_text, ',', 4) the_fourth_part dbadmin-> FROM (SELECT 'ONE,TWO,THREE,FOUR' my_text) foo; the_first_part | the_second_part | the_third_part | the_fourth_part ----------------+-----------------+----------------+----------------- ONE | TWO | THREE | FOUR (1 row)
But what if I don’t know how many parts there are in my text? As the following example shows, I can use the ROW NUMBER analytic function to dynamically split my text up!
dbadmin=> SELECT * FROM split_this_up ORDER BY 1; id | some_text ----+---------------------------------------- 1 | Please split this sentence up by word 2 | And do the same with this sentence too (2 rows) dbadmin=> SELECT id, word dbadmin-> FROM (SELECT id, split_part(some_text, ' ', row_number() over (PARTITION BY id)) word dbadmin(> FROM split_this_up dbadmin(> CROSS JOIN columns) foo dbadmin-> WHERE word <> ''; id | word ----+---------- 1 | Please 1 | split 1 | this 1 | sentence 1 | up 1 | by 1 | word 2 | And 2 | do 2 | the 2 | same 2 | with 2 | this 2 | sentence 2 | too (15 rows)
1
Comments
This is neat. What is the table
columns
in the phraseCROSS JOIN columns
and how does that work?EDIT Is this just using the system table
v_catalog
as an arbitrary table that is likely to have more rows than any string to be split has delimiters? For the userdbadmin
that is likely to be true, but does that also result in millions of blank rows in the to be filtered out byWHERE word <> ''
?This might equivalently have been joined by something else, such as...
CROSS JOIN ( SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 ) bar
I tend to agree -
ROW_NUMBER()
on the v_catalog.columns table , when, in one of the databases I work with, I end up with over 300 million total columns in the database, is a bit like chopping off a tree to get a toothpick ...There are actually several ways to do the same thing:
SPLIT_PART()
with aCROSS JOIN
of a series of consecutive INTEGERs;some_text
with square brackets, do aString_to_Array()
on that one, and applying anEXPLODE()
on the resulting arrayStringTokenizerDelim()
function from the Text-Index library .The first is the slowest, the last is the fastest, in a 300,000 row table with 100 tokens each .
Here's a script for you to try, based on Jim's example.
I do like the UNION ALL SELECT of consecutive integers, up to 10, I'd say.
If there are more, I "abuse" a side effect of the TIMESERIES clause to get a series of consecutive integers. Note the two CTE-s in the first of the 3 queries ...