Error when attempting to insert unicode into Vertica via python ODBC

I get an error when attempting to insert unicode into Vertica via Python's ODBC driver. Software info: Python 2.6.4 Pyodbc 3.0.6 unixODBC 2.2.14-12 vertica-odbc 6.0.0-0 RedHat Enterprise Linux 6.4 In Vertica I did: create table unitest (a varchar(100)); Then in Python I did: In [1]: import pyodbc In [2]: con = pyodbc.connect('DSN=prod_vertica', ansi=False) In [3]: cursor = con.cursor() In [4]: d1 = u'Apples' In [5]: d2 = 'Pears' In [6]: cursor.execute("insert into unitest(a) values(?)", [d1]) # Fails on Unicode --------------------------------------------------------------------------- DataError Traceback (most recent call last) in () ----> 1 cursor.execute("insert into unitest(a) values(?)", [d1]) DataError: ('22001', "[22001] [Vertica][ODBC] (10170) String data right truncation on data from data source: String data is too big for the driver's data buffer. (10170) (SQLPutData)") In [7]: cursor.execute("insert into unitest(a) values(?)", [d2]) # Succeeds Out[7]: In [8]: cursor.execute("insert into unitest(a) values(?)", [str(d1)]) #Succeeds once data is cast into an ANSI string Out[8]: In [9]: cursor.commit()

Comments

  • This problem can happen when a target field in Vertica is too narrow to store the data provided by the application.

    One common source of this error is with CHAR and VARCHAR (CHARACTER VARYING)
    columns.  In Vertica, character-type columns are sized to a number of
    bytes.  Depending on the characters used, the storage required per
    character varies from one to four bytes.  For most common character
    sets, two bytes per character is sufficient.  All character data is
    encoded using UTF-8 format for Unicode characters.

    Thus, to accept character data in Vertica you should size the column to have 2 or more bytes per character stored.

    Vertica limits the storage cost of the higher field with by stripping off
    padding characters, and encoding and compressing data on disk.
  • In my above example the column is defined as varchar(100), the Unicode string I'm attempting to store in it is 'apples'.
  • pyodbc not tested by Vertica - its doesn't support for all data types and reports wrong SQL TYPES.
    Read this thread for more info: http://vertica-forums.com/viewtopic.php?f=35&t=1629&p=5366

    PS
    It doesn't matter where is a bug in Vertica or in pyodbc - its just doesn't work well.
    I don't understand why Vertica claims that pyodbc is supported.

    And I prefer definition: "pyodbc does not supported", rather than "poor support" (becase Vertica doesn't responsible for pyodbc, but it responsible for it documentation).

  • Hi Daniel, for posting cross-links, we'd like to move towards something more like the StackOverflow model:  You're welcome to post links, but please summarize the content at the link when you do so that people know what they'd be getting by clicking on the link.

    The summary of that link is something like "Vertica's INTERVAL data type doesn't work via PyODBC."

    This is correct:  Supporting a product or tool, as you know, means enabling it to do what it is supposed/documented to do.  PyODBC allows users to access the following SQL data types:

    http://code.google.com/p/pyodbc/wiki/DataTypes

    INTERVAL is not on that list.  Therefore, unsurprisingly (and as we're well aware here), if you ask PyODBC for an interval, regardless of what database you're talking to, you won't get one.

    Is this a feature that PyODBC ought to consider adding?  In my opinion, yes; I think it would be very useful and would love to have it.  But right now they don't intend current releases to provide that functionality.

    "Support" for any third-party tool means "we guarantee that it will do with Vertica what it claims to be able to do generally."  It does not mean that we'll make tools somehow do more than they're documented/intended to be able to do.  If what a tool claims to be able to do is not what you want, then yes, it would be appropriate to use a different tool.  (Or to push on the tool in question to add the core functionality that you want; at which point, in our case, if the tool is supported, we'll make sure that the functionality works with Vertica.)
  • Hi David,

    Could you run the following command in Python?:

    import sys; print "%x" % sys.maxunicode

    Also, could you confirm the version of pyodbc that you are running?:

    import pyodbc; print pyodbc.version

    This sounds a lot like an old bug in pyodbc that was theoretically resolved in version 3.0.2.  Specifically, in Python versions where "sys.maxunicode" is greater than 0xffff, Python stores Unicode strings internally in UTF-32 format.  unixODBC typically expects wide strings to be in UTF-16 format; and pyodbc versions prior to the above did not convert when necessary.

    If this is your issue, the fix is to either update your pyodbc version or use a Python that's compiled to use "UCS2" as its internal Unicode encoding.

    All that said, you do say that you're using 3.0.6, which we support and which should contain this fix.  That's why I'd like to double-check that Python's importing the version that you expect that it's importing, before proceeding.

    If you're still seeing this issue in pyodbc 3.0.6, could you provide some more detail about your environment?  Are you using a custom build of unixODBC?  (Or is it from RHEL 6.4's yum repository?)  I don't see the issue on our end with those package versions.

    Thanks,
    Adam

  • Also, it is likely that you can work around the issue by explicitly encoding your string as UTF-8:

    >>> myStr = u'abc'.encode('UTF-8')

    I don't claim that it's a clean solution; it shouldn't be needed; but it should also work.
  • Interesting
    >>> import sys; print "%x" % sys.maxunicode
    10ffff
    >>> import pyodbc; print pyodbc.version
    3.0.6
    >>> u'abc'.encode('UTF-8')
    'abc'
    >>> con = pyodbc.connect('DSN=prod_vertica', ansi=False)
    >>> cursor = con.cursor()
    >>> d1 = u'Apples'
    >>> cursor.execute("insert into unitest(a) values(?)", [d1.encode('UTF-8')])
    <pyodbc.Cursor object at 0x7ffff6a693f0>
    >>> con.commit()

    However that places a bit of a burden on the other side of things:
    >>> d2 = u'test_chinese/铁丝.txt'
    >>> d2.encode('UTF-8')
    'test_chinese/\xe9\x93\x81\xe4\xb8\x9d.txt'
    >>> cursor.execute("insert into unitest(a) values(?)", [d2.encode('UTF-8')])
    <pyodbc.Cursor object at 0x7ffff6a693f0>
    >>> con.commit()
    >>> cursor.execute("SELECT * FROM unitest WHERE a like 'test_%'")
    <pyodbc.Cursor object at 0x7ffff6a693f0>
    >>> cursor.fetchall()
    [('test_chinese/\xe9\x93\x81\xe4\xb8\x9d.txt', )]







  • On the other side, you can decode the output with:

    val.decode('UTF-8')

    One more question about the environment, though -- where did you get your pyodbc module?  I don't recall 3.0.6 being stock on RHEL 6, though it might be available in 6.4?

    This could be caused by taking a version of the module that was originally compiled against a UTF-16 Python build and using it on a UTF-32 Python build.  (If that's the case, you'll likely hit other bugs too, sooner or later -- Python doesn't do that sort of binary compatibility.)

    If you built the module from source on this system, you should in theory be good to go.
  • OK I think we are narrowing in on this issue here.  At my work we perform our own builds of Python interpreters as to not be tied to the version shipped by our distro vendor.

    I asked around about out builds and we are running a build with the non-default internal unicode representation with the "--enable-unicode=ucs4" flag. Reading up a little about ucs2 and ucs4, it appears that ucs4 allows us to display characters outside of the BMP.

    As for pyodbc I downloaded 3.0.6 myself and did an install against our own Python build.

    So where do we go from here? Do I need to recompile unixODBC to support ucs4, or is this a pyodbc or an issue with Vertica's ODBC driver?

    On a side note, I originally posted my issue here, but our customer rep asked me to file a ticket with support@vertica.com, so this is also being tracked by Case 00022806.
  • Hi!


    >> Do I need to recompile unixODBC to support ucs4
    No

    >> is this a pyodbc or an issue with Vertica's ODBC driver?
    You can check it with isql. If it is works with isql so its a pyodbc-driver issue, if not so it's a Vertica driver issue.

    PS
    I think that something wrong with your environment, because its works for me.

    What is output of:
    $ python -c "import sys;print(sys.maxunicode<66000 and'UCS2'or'UCS4')"
    ANSI = TRUE
    >>> import pyodbc
    >>> vertica = pyodbc.connect(dsn='Vertica', ansi=True)
    >>> cursor = vertica.execute('insert into unitest values (?, ?)', 'ascii', u'unicode')
    >>> cursor.commit()
    >>> for record in vertica.execute('select * from unitest'):
    ... print record
    ...
    (u'ascii', u'unicode')
    >>> vertica.close()

    ANSI = FALSE
    >>> vertica = pyodbc.connect(dsn='Vertica', ansi=False)
    >>> cursor = vertica.execute('insert into unitest values (?, ?)', 'ascii', u'unicode')
    >>> cursor.commit()
    >>> for record in vertica.execute('select * from unitest'):
    ... print record
    ...
    (u'ascii', u'unicode')
    (u'ascii', u'unicode')
  • about Chinese, just don't use in prefix 'u', its unicode not a UTF-8
    >>> vertica = pyodbc.connect(dsn='Vertica', ansi=False)
    >>> chinese = '汉语'
    >>> cursor = vertica.execute('insert into unitest(col1) values (?)', chinese)
    >>> cursor.commit()
    >>> for record in vertica.execute('select * from unitest'):
    ... print record
    ...
    (u'ascii', u'unicode')
    (u'ascii', u'unicode')
    (u'\u6c49\u8bed', None)
    >>> print u'\u6c49\u8bed'
    汉语
    >>> for record in vertica.execute('select * from unitest'):
    ... for field in record:
    ... print field
    ...
    ascii
    unicode
    ascii
    unicode
    汉语
    None

  • My output is:
    >>> import sys;print(sys.maxunicode<66000 and'UCS2'or'UCS4')
    UCS4
    >>> sys.maxunicode
    1114111


    Regarding Chinese characters, the u prefix seems to indicate the type of encoding used:
    >>> chinese = '汉语'
    >>> print((chinese,))
    ('\xe6\xb1\x89\xe8\xaf\xad',)
    >>> isinstance(chinese, unicode)
    False
    >>> chinese = u'汉语'
    >>> print((chinese,))
    (u'\u6c49\u8bed',)
    >>> isinstance(chinese, unicode)
    True

  • Hi,

    Have you managed to find a solution to this problem?
    If so, could you please post it?

    Thanks,

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file