Vertica Collation Problem

nuhmehmetnuhmehmet Community Edition User

Hi everyone,

We have a problem with data integration. We are trying to put data to Vertica from MS SQL by using SAP DataServices, but we are getting an error(I shared in below) and whenever we try to change data types to varchar(255), we can handle that error.

We tried to change Vertica's collation but it didn't work. Before that we were seeing special characters(ş, ç eg.) as special icons like '?' or 'triangle' in SAP DataServices preview, we changed ODBC connection settings to handle that problem and it worked but still, we are getting an error if don't change default varchar values to 255.

Is there any known solution for this error?

Error
Possible causes: (1) Error in the SQL
syntax; (2) Database connection is broken; (3) Database related errors such as transaction log is full, etc.; (4) The user
defined in the datastore has insufficient privileges to execute the SQL. If the error is for preload or postload operation, or
if it is for regular load operation and load triggers are defined, check the SQL. Otherwise, for (3) and (4), contact your
local DBA

Thank you.

Best Answer

  • marcothesanemarcothesane - Select Field - Administrator
    Answer ✓

    The error message is not very informative. I would also blame SAP Data Service for not retrieving the error message from the ODBC/JDBC driver.

    What is the codepage in MS SQL? UTF-8 -> you should not have to change anything - just make sure that there is no conversion between MS SQL (and, if the data flows through it, SAP DataServices) and Vertica takes place. This is usually done by configuring the ODBC/JDBC data source. Same byte length and character length, in that case, should work.

    If you have single-byte strings in MS SQL (for example, ISO 8859-1/Western European), then double the length of the character fields in Vertica as a first step: VARCHAR(32) should become VARCHAR(64). Then, transfer the data, they should go in without truncations or data errors. Even if the Euro sign '€' is 3 bytes in UTF-8 and one byte in ISO8859-1, Western European alphabets never use double the number of bytes than characters.

    Once loaded, run a SELECT MAX(OCTET_LENGTH(affected_column)) FROM affected_table; on each affected column, and resize the VARCHAR length accordingly

Answers

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Vertica database servers expect to receive all data in UTF-8, and Vertica outputs all data in UTF-8.

    The maximum length parameter for VARCHAR and CHAR data type refers to the number of octets (bytes) that can be stored in that field, not the number of characters. When using multi-byte UTF-8 characters, make sure to size fields to accommodate from 1 to 4 bytes per character, depending on the data.

    See: https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/AdministratorsGuide/ConfiguringTheDB/ImplementLocalesForInternationalDataSets.htm

  • nuhmehmetnuhmehmet Community Edition User

    Hi,

    Thanks for your comments, I tried to change the code page and server code page to UTF-8 in data store settings for source and target in SAP DataServices but it didn't work.

    I also tried to double the varchar lengths of fields and it worked for default code page settings. Then, I ran the select statements that @marcothesane shared for varchar columns. I realized few things, some of the columns are returning higher lengths then their default varchar lengths. For example I have a column named "LOCAL_ADDRESS_001" and this column has varchar 35 value by default, when I ran the query it returns 38, I guess that's the problem but how can I describe that kind of situation?

    Thanks.

  • marcothesanemarcothesane - Select Field - Administrator

    Take a French first name as an example.

    'Jérôme' is 6 characters long.

    CHARACTER_LENGTH('Jérôme') will accordingly return 6.

    OCTET_LENGTH('Jérôme'), in Vertica, however, will return 8, as 'é' and 'ô' take two bytes in UTF-8 - while they take 1 byte in ISO 8859-1/Western European encoding.

    VARCHAR(35) in SQL Server will mean character length. VARCHAR(35) in Vertica always means length in bytes.

  • nuhmehmetnuhmehmet Community Edition User

    Hello,

    Thank you so much @marcothesane, I'm really appreciated.

    Thanks.

  • nuhmehmetnuhmehmet Community Edition User

    Hi again,

    I want to ask one more thing if it's possible :smile:

    How can I solve this problem? Do I need to change code page in source system or can I handle this problem in SAP DataServices by changing some settings(except changing varchar value manually)?

    Thank you.

  • marcothesanemarcothesane - Select Field - Administrator

    How would you regard your problem as solved?

    1. If it's your table in Vertica, same structure as in MS SQL, filled with the same data, do as I said: double the lengths of all CHARs and VARCHARs, move the data, then profile for the maximum string length you found with the query I provided, and re-size the strings.
    2. If it's an automation of the whole operation, then run a query against the MS SQL catalog - the system tables sys.tables and sys.columns, while multiplying columns.max_length by 2 for all string types, and add the formatting to make a CREATE TABLE out of it - and run that CREATE TABLE against Vertica. Then, get the data across, and profile for maximum lengths.

    There is an automatic data type profiler in GitHub: https://github.com/marco-the-sane/sqlprofile .

    You can't change the encoding from single-byte to UTF-8 in MS SQL without rebuilding all tables in that database.

    Or is it something else you're after?

  • nuhmehmetnuhmehmet Community Edition User

    Hi,

    I was thinking about second option that you shared, I think second option is the most optimum choice for us.

    Thank you.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file