Multibyte UTF8 character length counted per byte

I am setting up an ETL process to transfer data from an Oracle database into Vertica. The source data has some bad UTF-8 data interspersed throughout it - sometimes only part of a single field.

In one case I have a field that is varchar2 in the source database, varchar in Vertica and 30 octects long in both. The source data in this example 15 invalid UTF-8 character and one valid one, 16 octects in total. All of invalid ones are converted to the utf8 replacement character in Java (\uFFFD - http://www.fileformat.info/info/unicode/char/0fffd/index.htm) which is 3 bytes long, the resultant string is 16 characters but take 46 bytes.

Seeing which parts of the data got messed up with that character is a good thing, but a single byte solution would be better. I could mangle the data to fit, truncating etc - but it would be my last choice.

Any tips?

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file