Remove non UTF-8 characters from varchar column

Does anyone know how to remove non UTF-8 characters from string? Could not find any relevant info on web/forum.

I need to remove (replace) question mark in a diamond. Found out that ascii 56480 corresponds to E'/xa0' value and was able to remove it; but would prefer to have one code to remove all instances of these non printable characters.
Any ideas?

Answers

  • Jim_KnicelyJim_Knicely Administrator
    edited December 2019

    Hi,
    You should remove non UTF-8 characters using an ETL tool or process. But, if the data is already in Vertica, you can use a regular expression to remove all non UTF-8 characters...

    Example:

    dbadmin=> \d nonutf_test;
                                           List of Fields by Tables
     Schema |    Table    | Column |     Type     | Size | Default | Not Null | Primary Key | Foreign Key
    --------+-------------+--------+--------------+------+---------+----------+-------------+-------------
     public | nonutf_test | c      | varchar(100) |  100 |         | f        | f           |
    (1 row)
    
    dbadmin=> INSERT INTO nonutf_test SELECT E'\xa0' || 'Test!' || E'\xa0';
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> SELECT c, REGEXP_REPLACE(c, '[^\t\r\n\x20-\x7E]+', '', 1, 0, 'b') FROM nonutf_test ;
        c    | REGEXP_REPLACE
    ---------+----------------
     ▒Test!▒ | Test!
    (1 row)
    
    dbadmin=> SELECT * FROM nonutf_test;
        c
    ---------
     ▒Test!▒
    (1 row)
    
    dbadmin=> UPDATE /*+ DIRECT */ nonutf_test SET c = REGEXP_REPLACE(c, '[^\t\r\n\x20-\x7E]+', '', 1, 0, 'b');
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> SELECT * FROM nonutf_test;
       c
    -------
     Test!
    (1 row)
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file