We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Multibyte UTF8 character length counted per byte — Vertica Forum

Multibyte UTF8 character length counted per byte

I am setting up an ETL process to transfer data from an Oracle database into Vertica. The source data has some bad UTF-8 data interspersed throughout it - sometimes only part of a single field.

In one case I have a field that is varchar2 in the source database, varchar in Vertica and 30 octects long in both. The source data in this example 15 invalid UTF-8 character and one valid one, 16 octects in total. All of invalid ones are converted to the utf8 replacement character in Java (\uFFFD - http://www.fileformat.info/info/unicode/char/0fffd/index.htm) which is 3 bytes long, the resultant string is 16 characters but take 46 bytes.

Seeing which parts of the data got messed up with that character is a good thing, but a single byte solution would be better. I could mangle the data to fit, truncating etc - but it would be my last choice.

Any tips?

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file