Convert html encoded value to character in long varchar
We have a table with approx 50MM rows where ~7.5MM contain a long varchar column with multiple occurrences of embedded HTML encoded values (e.g.Mr. John O'Connor and his wife Jane O'Connor).
We are trying figure out the right combination of functions to somehow replace these with the actual character so that it becomes "Mr. John O'Connor and his wife Jane O'Connor".
We can see that Vertica SQL's CHR() function will take the hex value and return the character, but is there some way we can UPDATE the column value in place such that we can automate replacing anything STARTING WITH &#x and up to the ending ; and replace it with the CHR(x27) value.
From a regex perspective, I believe the search for &#x(.*); may provide group 1 as the value to put into CHR function, but I need to replace all occurrences of these types of values.
Has anyone figured this out?
Appreciate any suggestions.
Answers
I would write a Python UDSF for this. You could then use Python functions like unescape to transform the strings such as your example:
import html print(html.unescape('Mr. John O'Connor and his wife Jane O'Connor')) Mr. John O'Connor and his wife Jane O'ConnorPlease see the documentation for a complete example: https://docs.vertica.com/24.4.x/en/extending/developing-udxs/scalar-functions-udsfs/python-example-currency-convert/
In addition consider the following example:
CREATE TABLE sample_table ( id INT, text_column VARCHAR(255), record_date DATE NOT NULL ) PARTITION BY record_date GROUP BY CALENDAR_HIERARCHY_DAY(record_date,2,2); -- Create staging_sample_table with the same structure as sample_table CREATE TABLE staging_sample_table ( id INT, text_column VARCHAR(255), record_date DATE NOT NULL ) PARTITION BY record_date GROUP BY CALENDAR_HIERARCHY_DAY(record_date,2,2); -- Insert sample data COPY sample_table FROM STDIN DELIMITER '|' ABORT ON ERROR; 1|Mr. John O'Connor and his wife Jane O'Connor|2023-10-01 2|Hello 'World'!|2023-11-05 3|Please update me next time.'today'!|2023-12-02 \. SELECT COPY_PARTITIONS_TO_TABLE('sample_table', '2023-10-01', '2023-12-01', 'staging_sample_table'); update staging_sample_table set text_column = REGEXP_REPLACE(text_column, '&#x([0-9A-Fa-f]+);', CHR(HEX_TO_INTEGER(REGEXP_SUBSTR(text_column, '(?<=&#x)[0-9A-Fa-f]+(?=;)')))); SELECT SWAP_PARTITIONS_BETWEEN_TABLES('staging_sample_table', '2023-10-01', '2023-12-01', 'sample_table'); SELECT * FROM sample_table ORDER BY id; -- SELECT * FROM sample_table ORDER BY id; -- id | text_column | record_date -- ----+-----------------------------------------------+------------- -- 1 | Mr. John O'Connor and his wife Jane O'Connor | 2023-10-01 -- 2 | Hello 'World'! | 2023-11-05 -- 3 | Please update me next time.'today'! | 2023-12-02 -- (3 rows)