Error when attempting to insert unicode into Vertica via python ODBC
I get an error when attempting to insert unicode into Vertica via Python's ODBC driver. Software info: Python 2.6.4 Pyodbc 3.0.6 unixODBC 2.2.14-12 vertica-odbc 6.0.0-0 RedHat Enterprise Linux 6.4 In Vertica I did: create table unitest (a varchar(100)); Then in Python I did: In [1]: import pyodbc In [2]: con = pyodbc.connect('DSN=prod_vertica', ansi=False) In [3]: cursor = con.cursor() In [4]: d1 = u'Apples' In [5]: d2 = 'Pears' In [6]: cursor.execute("insert into unitest(a) values(?)", [d1]) # Fails on Unicode --------------------------------------------------------------------------- DataError Traceback (most recent call last) in () ----> 1 cursor.execute("insert into unitest(a) values(?)", [d1]) DataError: ('22001', "[22001] [Vertica][ODBC] (10170) String data right truncation on data from data source: String data is too big for the driver's data buffer. (10170) (SQLPutData)") In [7]: cursor.execute("insert into unitest(a) values(?)", [d2]) # Succeeds Out[7]: In [8]: cursor.execute("insert into unitest(a) values(?)", [str(d1)]) #Succeeds once data is cast into an ANSI string Out[8]: In [9]: cursor.commit()
0
Comments
One common source of this error is with CHAR and VARCHAR (CHARACTER VARYING)
columns. In Vertica, character-type columns are sized to a number of
bytes. Depending on the characters used, the storage required per
character varies from one to four bytes. For most common character
sets, two bytes per character is sufficient. All character data is
encoded using UTF-8 format for Unicode characters.
Thus, to accept character data in Vertica you should size the column to have 2 or more bytes per character stored.
Vertica limits the storage cost of the higher field with by stripping off
padding characters, and encoding and compressing data on disk.
Read this thread for more info: http://vertica-forums.com/viewtopic.php?f=35&t=1629&p=5366
PS
It doesn't matter where is a bug in Vertica or in pyodbc - its just doesn't work well.
I don't understand why Vertica claims that pyodbc is supported.
And I prefer definition: "pyodbc does not supported", rather than "poor support" (becase Vertica doesn't responsible for pyodbc, but it responsible for it documentation).
The summary of that link is something like "Vertica's INTERVAL data type doesn't work via PyODBC."
This is correct: Supporting a product or tool, as you know, means enabling it to do what it is supposed/documented to do. PyODBC allows users to access the following SQL data types:
http://code.google.com/p/pyodbc/wiki/DataTypes
INTERVAL is not on that list. Therefore, unsurprisingly (and as we're well aware here), if you ask PyODBC for an interval, regardless of what database you're talking to, you won't get one.
Is this a feature that PyODBC ought to consider adding? In my opinion, yes; I think it would be very useful and would love to have it. But right now they don't intend current releases to provide that functionality.
"Support" for any third-party tool means "we guarantee that it will do with Vertica what it claims to be able to do generally." It does not mean that we'll make tools somehow do more than they're documented/intended to be able to do. If what a tool claims to be able to do is not what you want, then yes, it would be appropriate to use a different tool. (Or to push on the tool in question to add the core functionality that you want; at which point, in our case, if the tool is supported, we'll make sure that the functionality works with Vertica.)
Could you run the following command in Python?:
import sys; print "%x" % sys.maxunicode
Also, could you confirm the version of pyodbc that you are running?:
import pyodbc; print pyodbc.version
This sounds a lot like an old bug in pyodbc that was theoretically resolved in version 3.0.2. Specifically, in Python versions where "sys.maxunicode" is greater than 0xffff, Python stores Unicode strings internally in UTF-32 format. unixODBC typically expects wide strings to be in UTF-16 format; and pyodbc versions prior to the above did not convert when necessary.
If this is your issue, the fix is to either update your pyodbc version or use a Python that's compiled to use "UCS2" as its internal Unicode encoding.
All that said, you do say that you're using 3.0.6, which we support and which should contain this fix. That's why I'd like to double-check that Python's importing the version that you expect that it's importing, before proceeding.
If you're still seeing this issue in pyodbc 3.0.6, could you provide some more detail about your environment? Are you using a custom build of unixODBC? (Or is it from RHEL 6.4's yum repository?) I don't see the issue on our end with those package versions.
Thanks,
Adam
>>> myStr = u'abc'.encode('UTF-8')
I don't claim that it's a clean solution; it shouldn't be needed; but it should also work.
>>> import sys; print "%x" % sys.maxunicode
10ffff
>>> import pyodbc; print pyodbc.version
3.0.6
>>> u'abc'.encode('UTF-8')
'abc'
>>> con = pyodbc.connect('DSN=prod_vertica', ansi=False)
>>> cursor = con.cursor()
>>> d1 = u'Apples'
>>> cursor.execute("insert into unitest(a) values(?)", [d1.encode('UTF-8')])
<pyodbc.Cursor object at 0x7ffff6a693f0>
>>> con.commit()
However that places a bit of a burden on the other side of things:
>>> d2 = u'test_chinese/铁丝.txt'
>>> d2.encode('UTF-8')
'test_chinese/\xe9\x93\x81\xe4\xb8\x9d.txt'
>>> cursor.execute("insert into unitest(a) values(?)", [d2.encode('UTF-8')])
<pyodbc.Cursor object at 0x7ffff6a693f0>
>>> con.commit()
>>> cursor.execute("SELECT * FROM unitest WHERE a like 'test_%'")
<pyodbc.Cursor object at 0x7ffff6a693f0>
>>> cursor.fetchall()
[('test_chinese/\xe9\x93\x81\xe4\xb8\x9d.txt', )]
val.decode('UTF-8')
One more question about the environment, though -- where did you get your pyodbc module? I don't recall 3.0.6 being stock on RHEL 6, though it might be available in 6.4?
This could be caused by taking a version of the module that was originally compiled against a UTF-16 Python build and using it on a UTF-32 Python build. (If that's the case, you'll likely hit other bugs too, sooner or later -- Python doesn't do that sort of binary compatibility.)
If you built the module from source on this system, you should in theory be good to go.
I asked around about out builds and we are running a build with the non-default internal unicode representation with the "--enable-unicode=ucs4" flag. Reading up a little about ucs2 and ucs4, it appears that ucs4 allows us to display characters outside of the BMP.
As for pyodbc I downloaded 3.0.6 myself and did an install against our own Python build.
So where do we go from here? Do I need to recompile unixODBC to support ucs4, or is this a pyodbc or an issue with Vertica's ODBC driver?
On a side note, I originally posted my issue here, but our customer rep asked me to file a ticket with support@vertica.com, so this is also being tracked by Case 00022806.
>> Do I need to recompile unixODBC to support ucs4
No
>> is this a pyodbc or an issue with Vertica's ODBC driver?
You can check it with isql. If it is works with isql so its a pyodbc-driver issue, if not so it's a Vertica driver issue.
PS
I think that something wrong with your environment, because its works for me.
What is output of: ANSI = TRUE ANSI = FALSE
>>> import sys;print(sys.maxunicode<66000 and'UCS2'or'UCS4')
UCS4
>>> sys.maxunicode
1114111
Regarding Chinese characters, the u prefix seems to indicate the type of encoding used:
>>> chinese = '汉语'
>>> print((chinese,))
('\xe6\xb1\x89\xe8\xaf\xad',)
>>> isinstance(chinese, unicode)
False
>>> chinese = u'汉语'
>>> print((chinese,))
(u'\u6c49\u8bed',)
>>> isinstance(chinese, unicode)
True
Have you managed to find a solution to this problem?
If so, could you please post it?
Thanks,