Error when attempting to insert unicode into Vertica via python ODBC

David_Gardner · January 2014

I get an error when attempting to insert unicode into Vertica via Python's ODBC driver. Software info: Python 2.6.4 Pyodbc 3.0.6 unixODBC 2.2.14-12 vertica-odbc 6.0.0-0 RedHat Enterprise Linux 6.4 In Vertica I did: create table unitest (a varchar(100)); Then in Python I did: In [1]: import pyodbc In [2]: con = pyodbc.connect('DSN=prod_vertica', ansi=False) In [3]: cursor = con.cursor() In [4]: d1 = u'Apples' In [5]: d2 = 'Pears' In [6]: cursor.execute("insert into unitest(a) values(?)", [d1]) # Fails on Unicode --------------------------------------------------------------------------- DataError Traceback (most recent call last) in () ----> 1 cursor.execute("insert into unitest(a) values(?)", [d1]) DataError: ('22001', "[22001] [Vertica][ODBC] (10170) String data right truncation on data from data source: String data is too big for the driver's data buffer. (10170) (SQLPutData)") In [7]: cursor.execute("insert into unitest(a) values(?)", [d2]) # Succeeds Out[7]: In [8]: cursor.execute("insert into unitest(a) values(?)", [str(d1)]) #Succeeds once data is cast into an ANSI string Out[8]: In [9]: cursor.commit()

R_McKeon · January 2014

This problem can happen when a target field in Vertica is too narrow to store the data provided by the application.

One common source of this error is with CHAR and VARCHAR (CHARACTER VARYING)
columns. In Vertica, character-type columns are sized to a number of
bytes. Depending on the characters used, the storage required per
character varies from one to four bytes. For most common character
sets, two bytes per character is sufficient. All character data is
encoded using UTF-8 format for Unicode characters.

Thus, to accept character data in Vertica you should size the column to have 2 or more bytes per character stored.

Vertica limits the storage cost of the higher field with by stripping off
padding characters, and encoding and compressing data on disk.

David_Gardner · January 2014

In my above example the column is defined as varchar(100), the Unicode string I'm attempting to store in it is 'apples'.

Daniel_Leybovic · January 2014

pyodbc not tested by Vertica - its doesn't support for all data types and reports wrong SQL TYPES.
Read this thread for more info: http://vertica-forums.com/viewtopic.php?f=35&t=1629&p=5366

PS
It doesn't matter where is a bug in Vertica or in pyodbc - its just doesn't work well.
I don't understand why Vertica claims that pyodbc is supported.

And I prefer definition: "pyodbc does not supported", rather than "poor support" (becase Vertica doesn't responsible for pyodbc, but it responsible for it documentation).

[Deleted User] · January 2014

Hi Daniel, for posting cross-links, we'd like to move towards something more like the StackOverflow model: You're welcome to post links, but please summarize the content at the link when you do so that people know what they'd be getting by clicking on the link.

The summary of that link is something like "Vertica's INTERVAL data type doesn't work via PyODBC."

This is correct: Supporting a product or tool, as you know, means enabling it to do what it is supposed/documented to do. PyODBC allows users to access the following SQL data types:

http://code.google.com/p/pyodbc/wiki/DataTypes

INTERVAL is not on that list. Therefore, unsurprisingly (and as we're well aware here), if you ask PyODBC for an interval, regardless of what database you're talking to, you won't get one.

Is this a feature that PyODBC ought to consider adding? In my opinion, yes; I think it would be very useful and would love to have it. But right now they don't intend current releases to provide that functionality.

"Support" for any third-party tool means "we guarantee that it will do with Vertica what it claims to be able to do generally." It does not mean that we'll make tools somehow do more than they're documented/intended to be able to do. If what a tool claims to be able to do is not what you want, then yes, it would be appropriate to use a different tool. (Or to push on the tool in question to add the core functionality that you want; at which point, in our case, if the tool is supported, we'll make sure that the functionality works with Vertica.)

[Deleted User] · January 2014

Hi David,

Could you run the following command in Python?:

import sys; print "%x" % sys.maxunicode

Also, could you confirm the version of pyodbc that you are running?:

import pyodbc; print pyodbc.version

This sounds a lot like an old bug in pyodbc that was theoretically resolved in version 3.0.2. Specifically, in Python versions where "sys.maxunicode" is greater than 0xffff, Python stores Unicode strings internally in UTF-32 format. unixODBC typically expects wide strings to be in UTF-16 format; and pyodbc versions prior to the above did not convert when necessary.

If this is your issue, the fix is to either update your pyodbc version or use a Python that's compiled to use "UCS2" as its internal Unicode encoding.

All that said, you do say that you're using 3.0.6, which we support and which should contain this fix. That's why I'd like to double-check that Python's importing the version that you expect that it's importing, before proceeding.

If you're still seeing this issue in pyodbc 3.0.6, could you provide some more detail about your environment? Are you using a custom build of unixODBC? (Or is it from RHEL 6.4's yum repository?) I don't see the issue on our end with those package versions.

Thanks,
Adam

[Deleted User] · January 2014

Also, it is likely that you can work around the issue by explicitly encoding your string as UTF-8:

>>> myStr = u'abc'.encode('UTF-8')

I don't claim that it's a clean solution; it shouldn't be needed; but it should also work.

David_Gardner · January 2014

Interesting
>>> import sys; print "%x" % sys.maxunicode
10ffff
>>> import pyodbc; print pyodbc.version
3.0.6
>>> u'abc'.encode('UTF-8')
'abc'
>>> con = pyodbc.connect('DSN=prod_vertica', ansi=False)
>>> cursor = con.cursor()
>>> d1 = u'Apples'
>>> cursor.execute("insert into unitest(a) values(?)", [d1.encode('UTF-8')])
<pyodbc.Cursor object at 0x7ffff6a693f0>
>>> con.commit()

However that places a bit of a burden on the other side of things:
>>> d2 = u'test_chinese/铁丝.txt'
>>> d2.encode('UTF-8')
'test_chinese/\xe9\x93\x81\xe4\xb8\x9d.txt'
>>> cursor.execute("insert into unitest(a) values(?)", [d2.encode('UTF-8')])
<pyodbc.Cursor object at 0x7ffff6a693f0>
>>> con.commit()
>>> cursor.execute("SELECT * FROM unitest WHERE a like 'test_%'")
<pyodbc.Cursor object at 0x7ffff6a693f0>
>>> cursor.fetchall()
[('test_chinese/\xe9\x93\x81\xe4\xb8\x9d.txt', )]

[Deleted User] · January 2014

On the other side, you can decode the output with:

val.decode('UTF-8')

One more question about the environment, though -- where did you get your pyodbc module? I don't recall 3.0.6 being stock on RHEL 6, though it might be available in 6.4?

This could be caused by taking a version of the module that was originally compiled against a UTF-16 Python build and using it on a UTF-32 Python build. (If that's the case, you'll likely hit other bugs too, sooner or later -- Python doesn't do that sort of binary compatibility.)

If you built the module from source on this system, you should in theory be good to go.

David_Gardner · February 2014

OK I think we are narrowing in on this issue here. At my work we perform our own builds of Python interpreters as to not be tied to the version shipped by our distro vendor.

I asked around about out builds and we are running a build with the non-default internal unicode representation with the "--enable-unicode=ucs4" flag. Reading up a little about ucs2 and ucs4, it appears that ucs4 allows us to display characters outside of the BMP.

As for pyodbc I downloaded 3.0.6 myself and did an install against our own Python build.

So where do we go from here? Do I need to recompile unixODBC to support ucs4, or is this a pyodbc or an issue with Vertica's ODBC driver?

On a side note, I originally posted my issue here, but our customer rep asked me to file a ticket with support@vertica.com, so this is also being tracked by Case 00022806.

Daniel_Leybovic · February 2014

Hi!

>> Do I need to recompile unixODBC to support ucs4
No

>> is this a pyodbc or an issue with Vertica's ODBC driver?
You can check it with isql. If it is works with isql so its a pyodbc-driver issue, if not so it's a Vertica driver issue.

PS
I think that something wrong with your environment, because its works for me.

What is output of:

$ python -c "import sys;print(sys.maxunicode<66000 and'UCS2'or'UCS4')"

ANSI = TRUE

>>> import pyodbc
>>> vertica = pyodbc.connect(dsn='Vertica', ansi=True)
>>> cursor = vertica.execute('insert into unitest values (?, ?)', 'ascii', u'unicode')
>>> cursor.commit()
>>> for record in vertica.execute('select * from unitest'):
...     print record
... 
(u'ascii', u'unicode')
>>> vertica.close()

ANSI = FALSE

>>> vertica = pyodbc.connect(dsn='Vertica', ansi=False)
>>> cursor = vertica.execute('insert into unitest values (?, ?)', 'ascii', u'unicode')
>>> cursor.commit()
>>> for record in vertica.execute('select * from unitest'):
...     print record
... 
(u'ascii', u'unicode')
(u'ascii', u'unicode')

Daniel_Leybovic · February 2014

about Chinese, just don't use in prefix 'u', its unicode not a UTF-8

>>> vertica = pyodbc.connect(dsn='Vertica', ansi=False)
>>> chinese = '汉语'
>>> cursor = vertica.execute('insert into unitest(col1) values (?)', chinese)
>>> cursor.commit()
>>> for record in vertica.execute('select * from unitest'):
...   print record
... 
(u'ascii', u'unicode')
(u'ascii', u'unicode')
(u'\u6c49\u8bed', None)
>>> print u'\u6c49\u8bed'
汉语
>>> for record in vertica.execute('select * from unitest'):
...     for field in record:
...             print field
... 
ascii
unicode
ascii
unicode
汉语
None

David_Gardner · February 2014

My output is:
>>> import sys;print(sys.maxunicode<66000 and'UCS2'or'UCS4')
UCS4
>>> sys.maxunicode
1114111

Regarding Chinese characters, the u prefix seems to indicate the type of encoding used:
>>> chinese = '汉语'
>>> print((chinese,))
('\xe6\xb1\x89\xe8\xaf\xad',)
>>> isinstance(chinese, unicode)
False
>>> chinese = u'汉语'
>>> print((chinese,))
(u'\u6c49\u8bed',)
>>> isinstance(chinese, unicode)
True

Xoerk · April 2014

Hi,

Have you managed to find a solution to this problem?
If so, could you please post it?

Thanks,

We're Moving!

Create My New Community Account Now

Error when attempting to insert unicode into Vertica via python ODBC

Comments

Leave a Comment