Convert html encoded value to character in long varchar

jordzillajordzilla Vertica Customer
edited November 1 in General Discussion

We have a table with approx 50MM rows where ~7.5MM contain a long varchar column with multiple occurrences of embedded HTML encoded values (e.g.Mr. John O'Connor and his wife Jane O'Connor).
We are trying figure out the right combination of functions to somehow replace these with the actual character so that it becomes "Mr. John O'Connor and his wife Jane O'Connor".

We can see that Vertica SQL's CHR() function will take the hex value and return the character, but is there some way we can UPDATE the column value in place such that we can automate replacing anything STARTING WITH &#x and up to the ending ; and replace it with the CHR(x27) value.

From a regex perspective, I believe the search for &#x(.*); may provide group 1 as the value to put into CHR function, but I need to replace all occurrences of these types of values.

Has anyone figured this out?

Appreciate any suggestions.

Answers

  • Bryan_HBryan_H Vertica Employee Administrator

    I would write a Python UDSF for this. You could then use Python functions like unescape to transform the strings such as your example:

    import html
    print(html.unescape('Mr. John O'Connor and his wife Jane O'Connor'))
    Mr. John O'Connor and his wife Jane O'Connor
    

    Please see the documentation for a complete example: https://docs.vertica.com/24.4.x/en/extending/developing-udxs/scalar-functions-udsfs/python-example-currency-convert/

  • moshegmosheg Vertica Employee Administrator

    In addition consider the following example:

    CREATE TABLE sample_table (
        id INT,
        text_column VARCHAR(255),
        record_date DATE NOT NULL
    ) PARTITION BY record_date
    GROUP BY CALENDAR_HIERARCHY_DAY(record_date,2,2);
    
    -- Create staging_sample_table with the same structure as sample_table
    CREATE TABLE staging_sample_table (
        id INT,
        text_column VARCHAR(255),
        record_date DATE NOT NULL
    ) PARTITION BY record_date
    GROUP BY CALENDAR_HIERARCHY_DAY(record_date,2,2);
    
    -- Insert sample data
    COPY sample_table FROM STDIN DELIMITER '|' ABORT ON ERROR;
    1|Mr. John O'Connor and his wife Jane O'Connor|2023-10-01
    2|Hello 'World'!|2023-11-05
    3|Please update me next time.'today'!|2023-12-02
    \.
    
    SELECT COPY_PARTITIONS_TO_TABLE('sample_table', '2023-10-01', '2023-12-01', 'staging_sample_table');
    
    update staging_sample_table
    set text_column = REGEXP_REPLACE(text_column, '&#x([0-9A-Fa-f]+);',
                       CHR(HEX_TO_INTEGER(REGEXP_SUBSTR(text_column, '(?<=&#x)[0-9A-Fa-f]+(?=;)'))));
    
    SELECT SWAP_PARTITIONS_BETWEEN_TABLES('staging_sample_table', '2023-10-01', '2023-12-01', 'sample_table');
    SELECT * FROM sample_table ORDER BY id;
    -- SELECT * FROM sample_table ORDER BY id;
    --  id |                  text_column                  | record_date
    -- ----+-----------------------------------------------+-------------
    --   1 | Mr. John O'Connor and his wife Jane O'Connor  | 2023-10-01
    --   2 | Hello 'World'!                                | 2023-11-05
    --   3 | Please update me next time.&#x27;today&#x27;! | 2023-12-02
    -- (3 rows)
    
This discussion has been closed.