Remove non UTF-8 characters from varchar column
Does anyone know how to remove non UTF-8 characters from string? Could not find any relevant info on web/forum.
I need to remove (replace) question mark in a diamond. Found out that ascii 56480 corresponds to E'/xa0' value and was able to remove it; but would prefer to have one code to remove all instances of these non printable characters.
Any ideas?
0
Answers
Hi,
You should remove non UTF-8 characters using an ETL tool or process. But, if the data is already in Vertica, you can use a regular expression to remove all non UTF-8 characters...
Example:
dbadmin=> \d nonutf_test; List of Fields by Tables Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key --------+-------------+--------+--------------+------+---------+----------+-------------+------------- public | nonutf_test | c | varchar(100) | 100 | | f | f | (1 row) dbadmin=> INSERT INTO nonutf_test SELECT E'\xa0' || 'Test!' || E'\xa0'; OUTPUT -------- 1 (1 row) dbadmin=> SELECT c, REGEXP_REPLACE(c, '[^\t\r\n\x20-\x7E]+', '', 1, 0, 'b') FROM nonutf_test ; c | REGEXP_REPLACE ---------+---------------- ▒Test!▒ | Test! (1 row) dbadmin=> SELECT * FROM nonutf_test; c --------- ▒Test!▒ (1 row) dbadmin=> UPDATE /*+ DIRECT */ nonutf_test SET c = REGEXP_REPLACE(c, '[^\t\r\n\x20-\x7E]+', '', 1, 0, 'b'); OUTPUT -------- 1 (1 row) dbadmin=> SELECT * FROM nonutf_test; c ------- Test! (1 row)