Efficiently Filtering "Test" Keywords and Handling Hebrew Text in Vertica

Hd17 · December 2024

I'm working with a Vertica database and sometimes users insert records that contain "test" or "testing" keywords in both English and Hebrew.
1. How can I efficiently filter out these test records in future queries?
2. Why VARCHAR field length behaves differently with Hebrew text?
Can you provide guidance on using REGEXP to exclude test-related entries?
How to check the content of the varchar field value?

mosheg · December 2024

To efficiently avoid records with the words "test" or "testing" in English or Hebrew in future queries, you can use Vertica's REGEXP_NOT_LIKE function.
This function allows you to filter out rows that match a specific pattern, which in this case would be the words "test", "testing",
or their Hebrew equivalent "בדיקה".

Regarding the VARCHAR field length for Hebrew text, it's important to note that Hebrew characters are typically encoded using UTF-8, which uses multiple bytes per character.
While VARCHAR in Vertica is defined in terms of bytes, not characters.
It is advised to enlarge the field varchar size when dealing with multi-byte character sets like Hebrew.

Here's an example demonstrating how to filter out test records and showing the behavior of VARCHAR with Hebrew text:

cat regexp_example.sql
-- Create a table for this demonstration
DROP TABLE IF EXISTS temp_input CASCADE;
CREATE TABLE temp_input
 ( id INT,
   string VARCHAR(100),
   valid BOOLEAN default REGEXP_NOT_LIKE(string, '(?i)test|בדיקה'))
ORDER BY valid;

-- Insert sample data into the temporary table
INSERT INTO temp_input (id, string) VALUES
(1, 'Vertica RTM Version   '),
(2, 'This is a test case   '),
(3, 'Testing in progress   '),
(4, 'Vertica MC version    '),
(5, 'TEST environment setup'),
(6, 'This goes to the ROS  '),
(7, 'בדיקה של המערכת       '),
(8, 'Data analysis         '),
(9, 'This string with Ctl M' || CHR(13) ) ;

commit;

SELECT * FROM temp_input order by id;
--  id |         string          | valid
-- ----+-------------------------+-------
--   1 | Vertica RTM Version     | t
--   2 | This is a test case     | f
--   3 | Testing in progress     | f
--   4 | Vertica MC version      | t
--   5 | TEST environment setup  | f
--   6 | This goes to the ROS    | t
--   7 | בדיקה של המערכת          | f
--   8 | Data analysis           | t
--   9 | This string with Ctl M  | t
-- (9 rows)


-- Select only valid rows
SELECT * FROM temp_input WHERE valid order by id;
--  id |         string          | valid
-- ----+-------------------------+-------
--   1 | Vertica RTM Version     | t
--   4 | Vertica MC version      | t
--   6 | This goes to the ROS    | t
--   8 | Data analysis           | t
--   9 | This string with Ctl M  | t
-- (5 rows)


-- Analyzes various characteristics of the strings in the temp_input table
SELECT id,
    LENGTH(string) AS vertica_length,
    CHAR_LENGTH(string) AS char_count,
    REGEXP_COUNT(string, '\s') || ' x 1 ' AS space_ch_count,
    REGEXP_COUNT(string, '[\x{05D0}-\x{05EA}]') || ' x 2 ' AS hebrew_ch_count,
    REGEXP_COUNT(string, '[a-zA-Z]') || ' x 1 ' AS english_ch_count,
    OCTET_LENGTH(string) AS byte_count,
    CASE
           WHEN REGEXP_LIKE(string, '[\x{05D0}-\x{05EA}]') THEN 'With Hebrew Letters'
           ELSE 'No Hebrew Letters'
    END AS hebrew_check
FROM temp_input
ORDER BY id;
--  id | vertica_length | char_count | space_ch_count | hebrew_ch_count | english_ch_count | byte_count |    hebrew_check
-- ----+----------------+------------+----------------+-----------------+------------------+------------+---------------------
--   1 |             22 |         22 |  5 x 1         |  0 x 2          | 17 x 1           |         22 | No Hebrew Letters
--   2 |             22 |         22 |  7 x 1         |  0 x 2          | 15 x 1           |         22 | No Hebrew Letters
--   3 |             22 |         22 |  5 x 1         |  0 x 2          | 17 x 1           |         22 | No Hebrew Letters
--   4 |             22 |         22 |  6 x 1         |  0 x 2          | 16 x 1           |         22 | No Hebrew Letters
--   5 |             22 |         22 |  2 x 1         |  0 x 2          | 20 x 1           |         22 | No Hebrew Letters
--   6 |             22 |         22 |  6 x 1         |  0 x 2          | 16 x 1           |         22 | No Hebrew Letters
--   7 |             22 |         22 |  9 x 1         | 13 x 2          |  0 x 1           |         35 | With Hebrew Letters
--   8 |             22 |         22 | 10 x 1         |  0 x 2          | 12 x 1           |         22 | No Hebrew Letters
--   9 |             23 |         23 |  5 x 1         |  0 x 2          | 18 x 1           |         23 | No Hebrew Letters
-- (9 rows)


-- the od -c option displays the output in character format, showing also non-printable characters
\! vsql -Xqtc "SELECT string FROM temp_input WHERE id=9;" | od -c
--
-- 0000000       T   h   i   s       s   t   r   i   n   g       w   i   t
-- 0000020   h       C   t   l       M  \r  \n  \n
-- 0000032

We're Moving!

Create My New Community Account Now

Efficiently Filtering "Test" Keywords and Handling Hebrew Text in Vertica

Best Answer