Data issue regarding Length and Copying the same data

Madhu_Sudhan_Re · February 2015

Hi
I have created the following table and got the following errors

Create table tab1(col1 varchar(30)) ;

And tried inserting the following data
insert into tab1(col1) values('¿Cómo se llama usted?123456789');

It showed the following error
ERROR 4800: String of 33 octets is too long for type Varchar(30)

When tried calculating the length of the string using the
select length('¿Cómo se llama usted?123456789'); the value it returned is 30

Note: There are special characters in my string that i have entered and they are valid for UTF8 encoding

Can someone help

knicely87 · February 2015

The number of octets in your string is 32:

dbadmin=> select octet_length('¿Cómo se llama usted?123456789');
octet_length
--------------
           32
(1 row)

dbadmin=> create table tab1(col1 varchar(32)) ;
CREATE TABLE

dbadmin=> insert into tab1(col1) values('¿Cómo se llama usted?123456789');
OUTPUT
--------
      1
(1 row)

dbadmin=> select * from tab1;
              col1
--------------------------------
¿Cómo se llama usted?123456789
(1 row)

sudhir_1 · February 2015

Hi,

if you know there are special characters in the string, then shouldn't you double or quadruple the length? so i would go for 60 or 90 or 120 - there is really no space wasted since its varchar any way,

"Tip: When specifying the maximum column width in a CREATE TABLE statement, use the width in bytes (octets) for any of the string types. Each UTF-8 character may require four bytes, but European languages generally require a little over one byte per character, while Oriental language generally require a little under three bytes per character"

thanks
Sudhir

Marco_Gessner · February 2015

Hi All -

Around 90% of all letters in typical European languages fit into the ASCII code page, hence are single byte. The inverted question mark/ exclamation mark as it is used in Spanish takes two bytes. Any accented Latin character (éêè) takes two bytes. Same applies to characters with diaeresis (ö) or cedilla (ç). The Euro sign(€), in UTF-8, takes three bytes.

Multiplying by four is a waste of resources - because, when materialising a VARCHAR(n), I would expect Vertica to pre-allocate the memory for the full amount of n, which is a huge waste of memory. The same applies to ETL tools trying to insert into Vertica or BI or other front end tools reading from Vertica: They will always pre-allocate the maximum possible length of a string; thus using (and taking away from others) very close to four times as much memory as actually needed.

There is no simple answer to this question. My approach - when I have to load a data file with single-byte text fields (like Western Latin in Windows, for example), is to multiply each string length by 3 for a staging table. Then to load into the staging table; then to measure MAX(OCTET_LENGTH<column>) for each string column. Finally, I would add about 5 to 10% of the length to the string fields (but only those that contain non-ASCII characters), to cater for future, longer strings.

With the results obtained, I would create the actual table, and fill it with an INSERT /*+direct */ ... SELECT ... from the staging table.

Comments welcome ...

marco the sane

We're Moving!

Create My New Community Account Now

Data issue regarding Length and Copying the same data

Comments

Leave a Comment