Efficiently Filtering "Test" Keywords and Handling Hebrew Text in Vertica

I'm working with a Vertica database and sometimes users insert records that contain "test" or "testing" keywords in both English and Hebrew.
1. How can I efficiently filter out these test records in future queries?
2. Why VARCHAR field length behaves differently with Hebrew text?
Can you provide guidance on using REGEXP to exclude test-related entries?
How to check the content of the varchar field value?

Best Answer

  • moshegmosheg Vertica Employee Administrator
    Answer ✓

    To efficiently avoid records with the words "test" or "testing" in English or Hebrew in future queries, you can use Vertica's REGEXP_NOT_LIKE function.
    This function allows you to filter out rows that match a specific pattern, which in this case would be the words "test", "testing",
    or their Hebrew equivalent "בדיקה".

    Regarding the VARCHAR field length for Hebrew text, it's important to note that Hebrew characters are typically encoded using UTF-8, which uses multiple bytes per character.
    While VARCHAR in Vertica is defined in terms of bytes, not characters.
    It is advised to enlarge the field varchar size when dealing with multi-byte character sets like Hebrew.

    Here's an example demonstrating how to filter out test records and showing the behavior of VARCHAR with Hebrew text:

    cat regexp_example.sql
    -- Create a table for this demonstration
    DROP TABLE IF EXISTS temp_input CASCADE;
    CREATE TABLE temp_input
     ( id INT,
       string VARCHAR(100),
       valid BOOLEAN default REGEXP_NOT_LIKE(string, '(?i)test|בדיקה'))
    ORDER BY valid;
    
    -- Insert sample data into the temporary table
    INSERT INTO temp_input (id, string) VALUES
    (1, 'Vertica RTM Version   '),
    (2, 'This is a test case   '),
    (3, 'Testing in progress   '),
    (4, 'Vertica MC version    '),
    (5, 'TEST environment setup'),
    (6, 'This goes to the ROS  '),
    (7, 'בדיקה של המערכת       '),
    (8, 'Data analysis         '),
    (9, 'This string with Ctl M' || CHR(13) ) ;
    
    commit;
    
    SELECT * FROM temp_input order by id;
    --  id |         string          | valid
    -- ----+-------------------------+-------
    --   1 | Vertica RTM Version     | t
    --   2 | This is a test case     | f
    --   3 | Testing in progress     | f
    --   4 | Vertica MC version      | t
    --   5 | TEST environment setup  | f
    --   6 | This goes to the ROS    | t
    --   7 | בדיקה של המערכת          | f
    --   8 | Data analysis           | t
    --   9 | This string with Ctl M  | t
    -- (9 rows)
    
    
    -- Select only valid rows
    SELECT * FROM temp_input WHERE valid order by id;
    --  id |         string          | valid
    -- ----+-------------------------+-------
    --   1 | Vertica RTM Version     | t
    --   4 | Vertica MC version      | t
    --   6 | This goes to the ROS    | t
    --   8 | Data analysis           | t
    --   9 | This string with Ctl M  | t
    -- (5 rows)
    
    
    -- Analyzes various characteristics of the strings in the temp_input table
    SELECT id,
        LENGTH(string) AS vertica_length,
        CHAR_LENGTH(string) AS char_count,
        REGEXP_COUNT(string, '\s') || ' x 1 ' AS space_ch_count,
        REGEXP_COUNT(string, '[\x{05D0}-\x{05EA}]') || ' x 2 ' AS hebrew_ch_count,
        REGEXP_COUNT(string, '[a-zA-Z]') || ' x 1 ' AS english_ch_count,
        OCTET_LENGTH(string) AS byte_count,
        CASE
               WHEN REGEXP_LIKE(string, '[\x{05D0}-\x{05EA}]') THEN 'With Hebrew Letters'
               ELSE 'No Hebrew Letters'
        END AS hebrew_check
    FROM temp_input
    ORDER BY id;
    --  id | vertica_length | char_count | space_ch_count | hebrew_ch_count | english_ch_count | byte_count |    hebrew_check
    -- ----+----------------+------------+----------------+-----------------+------------------+------------+---------------------
    --   1 |             22 |         22 |  5 x 1         |  0 x 2          | 17 x 1           |         22 | No Hebrew Letters
    --   2 |             22 |         22 |  7 x 1         |  0 x 2          | 15 x 1           |         22 | No Hebrew Letters
    --   3 |             22 |         22 |  5 x 1         |  0 x 2          | 17 x 1           |         22 | No Hebrew Letters
    --   4 |             22 |         22 |  6 x 1         |  0 x 2          | 16 x 1           |         22 | No Hebrew Letters
    --   5 |             22 |         22 |  2 x 1         |  0 x 2          | 20 x 1           |         22 | No Hebrew Letters
    --   6 |             22 |         22 |  6 x 1         |  0 x 2          | 16 x 1           |         22 | No Hebrew Letters
    --   7 |             22 |         22 |  9 x 1         | 13 x 2          |  0 x 1           |         35 | With Hebrew Letters
    --   8 |             22 |         22 | 10 x 1         |  0 x 2          | 12 x 1           |         22 | No Hebrew Letters
    --   9 |             23 |         23 |  5 x 1         |  0 x 2          | 18 x 1           |         23 | No Hebrew Letters
    -- (9 rows)
    
    
    -- the od -c option displays the output in character format, showing also non-printable characters
    \! vsql -Xqtc "SELECT string FROM temp_input WHERE id=9;" | od -c
    --
    -- 0000000       T   h   i   s       s   t   r   i   n   g       w   i   t
    -- 0000020   h       C   t   l       M  \r  \n  \n
    -- 0000032
    
This discussion has been closed.