Multibyte UTF8 character length counted per byte
I am setting up an ETL process to transfer data from an Oracle database into Vertica. The source data has some bad UTF-8 data interspersed throughout it - sometimes only part of a single field.
In one case I have a field that is varchar2 in the source database, varchar in Vertica and 30 octects long in both. The source data in this example 15 invalid UTF-8 character and one valid one, 16 octects in total. All of invalid ones are converted to the utf8 replacement character in Java (\uFFFD - http://www.fileformat.info/info/unicode/char/0fffd/index.htm) which is 3 bytes long, the resultant string is 16 characters but take 46 bytes.
Seeing which parts of the data got messed up with that character is a good thing, but a single byte solution would be better. I could mangle the data to fit, truncating etc - but it would be my last choice.
Any tips?
In one case I have a field that is varchar2 in the source database, varchar in Vertica and 30 octects long in both. The source data in this example 15 invalid UTF-8 character and one valid one, 16 octects in total. All of invalid ones are converted to the utf8 replacement character in Java (\uFFFD - http://www.fileformat.info/info/unicode/char/0fffd/index.htm) which is 3 bytes long, the resultant string is 16 characters but take 46 bytes.
Seeing which parts of the data got messed up with that character is a good thing, but a single byte solution would be better. I could mangle the data to fit, truncating etc - but it would be my last choice.
Any tips?
0