Convert html encoded value to character in long varchar
We have a table with approx 50MM rows where ~7.5MM contain a long varchar column with multiple occurrences of embedded HTML encoded values (e.g.Mr. John O'Connor and his wife Jane O'Connor
).
We are trying figure out the right combination of functions to somehow replace these with the actual character so that it becomes "Mr. John O'Connor and his wife Jane O'Connor".
We can see that Vertica SQL's CHR() function will take the hex value and return the character, but is there some way we can UPDATE the column value in place such that we can automate replacing anything STARTING WITH &#x
and up to the ending ;
and replace it with the CHR(x27) value.
From a regex perspective, I believe the search for &#x(.*);
may provide group 1 as the value to put into CHR function, but I need to replace all occurrences of these types of values.
Has anyone figured this out?
Appreciate any suggestions.
Answers
I would write a Python UDSF for this. You could then use Python functions like unescape to transform the strings such as your example:
Please see the documentation for a complete example: https://docs.vertica.com/24.4.x/en/extending/developing-udxs/scalar-functions-udsfs/python-example-currency-convert/
In addition consider the following example: