RJDBC UTF-8 encoding

DerekW · August 2018

Hi all,

Not sure if this is related specifically to Vertica or not, but I'm out of options.. so let's see.

I'm creating a Shiny app in R, where I need to read data from Vertica, and load data back into Vertica. When I started with this app, I was using RODBC (which I normally use when connecting to R). The app works fine locally, but after deploying the app to Shiny Server (which runs on Debian) the app crashes with a segfault error. I found this article, which unfortunately was not giving me an answer.

After doing some testing and reading, I finally switched to RJDBC, which works fine with Shiny Server. The only problem that I have now is the encoding of the data. Some of the data that I read from Vertica contain emoji's. Somehow these get scrambled with RJDBC, and causing my app to crash. I suspect this is related to the fact that the JDBC drivers convert everything from UTF-8 to UTF-16 (see also documentation here).

Ideally I use RODBC (I tested locally, and all emoji's look fine when using this method), but due to the segfault issue I cannot use it.

Any pointers on how to solve this are appreciated, because this problem is currently blocking any further progress on my app.

Kind Regards,
Derek

DerekW · August 2018

To determine if the encoding issue is a JDBC driver related problem or a R related problem, I executed the same query in Python using the jaydebeapi package, which is also using the JDBC driver. Here I get the exact same problem:
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 767-768: surrogates not allowed. This means somewhere in the JDBC driver the conversion goes wrong.

I also tried the native client (vertica_python) which gives correct results.

We're Moving!

Create My New Community Account Now

RJDBC UTF-8 encoding

Comments

Leave a Comment