Find abbreviated words / Non - English letters in text

edited July 11 in General Discussion

Hi,
Text1: 'Vertica RTM version'
Text2: 'Vertica rtm version'
My requirement is to find Text1 cases from a column data. i.e. Text which contains abbreviated data.
Kindly assist.

Tagged:

Best Answers

  • SruthiASruthiA Employee
    Accepted Answer

    @shashank_p Please find it below

    dbadmin=> select * from test_h1;
    i | string
    ---+----------------------------------
    1 | ÖV is public transport in German
    2 | ÖÖÖÖ!!!
    2 | *!@sam
    2 | 123fg
    2 | Public Örchid
    (5 rows)

    dbadmin=>

    dbadmin=> select * from test_h1 WHERE regexp_ilike(string , '[^\x00-\x7F]+');
    i | string
    ---+----------------------------------
    1 | ÖV is public transport in German
    2 | ÖÖÖÖ!!!
    2 | Public Örchid
    (3 rows)

Answers

  • moshegmosheg Employee

    If performance satisfies try the following example.
    If not try to create a text index described here:
    https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/AdministratorsGuide/Tables/TextSearch/TextSearchConceptual.htm
    There is also an option to use a free text search with Vertica/Lucene++ UDx Integration package by Maurizio Felici.

    CREATE TABLE public.t1 (row_id int, f1 varchar(50));
    INSERT INTO t1 VALUES(1,'Vertica RTM version');
    INSERT INTO t1 VALUES(2,'Vertica rtm version');
    INSERT INTO t1 VALUES(3,'Vertica somthing else');
    COMMIT;
    SELECT * from t1 WHERE f1 ilike '%rtm%';
     row_id |         f1
    --------+---------------------
          1 | Vertica RTM version
          2 | Vertica rtm version
    (2 rows)
    
    SELECT * from t1 WHERE f1 like '%rtm%';
     row_id |         f1
    --------+---------------------
          2 | Vertica rtm version
    (1 row)
    
  • abbreviation is not static, it is dynamic for each row.

  • moshegmosheg Employee

    To better understand your need, kindly share few rows as sample data and a query example.

  • marcothesanemarcothesane Employee

    Do you want to find:

    • find substrings that consist of , say, 2 to 5 capital alphabetic letters? and include accents and such, like in 'Ä' 'Ñ' ?
    • find these substrings, and then find out if they are really existing abbreviations, by looking up a dictionary ?

    For the first , try:

    WITH
    input(string) AS (
              SELECT 'Vertica RTM Version'
    UNION ALL SELECT 'Vertica rtm version'
    UNION ALL SELECT 'Vertica MC version'
    UNION ALL SELECT 'Vertica mc version'
    UNION ALL SELECT 'This goes to the ROS'
    UNION ALL SELECT 'This goes to the ros'
    UNION ALL SELECT 'The wos is deprecated'
    UNION ALL SELECT 'The WOS is deprecated'
    )
    SELECT
      *
    FROM input
    WHERE REGEXP_LIKE(string,'[A-Z]{2,4}') 
    ;
    

    which would return:

            string         
    -----------------------
     Vertica RTM Version
     Vertica MC version
     This goes to the ROS
     The WOS is deprecated
    
    
  • marcothesanemarcothesane Employee

    To include strange letters: This one also gets the last row:

    WITH
    input(string) AS (
              SELECT 'Vertica RTM Version'
    UNION ALL SELECT 'Vertica rtm version'
    UNION ALL SELECT 'Vertica MC version'
    UNION ALL SELECT 'Vertica mc version'
    UNION ALL SELECT 'This goes to the ROS'
    UNION ALL SELECT 'This goes to the ros'
    UNION ALL SELECT 'The wos is deprecated'
    UNION ALL SELECT 'The WOS is deprecated'
    UNION ALL SELECT 'ÖV is public transport in German'
    )
    SELECT
      *
    FROM input
    WHERE REGEXP_LIKE(string,'[[:upper:]]{2,4}')        
    ;
    
  • The abbreviations are dynamic it means it can be 2 letters / 3 letters etc. all characters are in English language.
    Giving you some data for better understanding. Suppose column contains 5 records.

    This is TEXT message
    HYPERLOOP is a concept
    stock prices soar today
    Laptop is great device
    Capture these WORDS

    Only record 1 , 2 and 5 should be picked. Let me know if you need more information.

  • moshegmosheg Employee

    Why record 1 , 2 and 5 should be picked?
    Does those records include a match string to one of the strings in a dynamic list?

  • edited July 5

    @marcothesane You got it and this is what I needed!
    @mosheg I hope you got my question, there is no separate dynamic list, but there are dynamic abbreviations in the records.
    Just a follow up question since it is regarding same concept. How to find records that contain Non - English characters.
    For eg.
    Record 1 - Hi, How are you doing!
    Record 2 - [email protected]#***
    Record 3 - ÖV is public transport in German
    I need condition where it should pick only Record 2. Again I tried this with multiple data, the problem is that I'm not getting right condition which can ignore all symbols accepted in English.

  • marcothesanemarcothesane Employee

    They contain several consecutive uppercase alpha characters between word boundaries, @mosheg .
    Now, @shashank_p , if FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY (and even longer than that) is also a valid pattern you're looking for, I'll need to modify my WHERE condition from above, to : a word boundary, two or more uppercase alpha characters, a word boundary. And I use the posix character class identifer [:upper:] as above, to also accommodate "dirty" Latin letters:

    WITH
    input(string) AS (
              SELECT 'This is TEXT message'
    UNION ALL SELECT 'HYPERLOOP is a concept'
    UNION ALL SELECT 'stock prices soar today'
    UNION ALL SELECT 'Laptop is great device'
    UNION ALL SELECT 'Capture these WORDS'
    UNION ALL SELECT 'FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY'
    )
    SELECT
      *
    FROM input
    WHERE REGEXP_LIKE(string,'\b[[:upper:]]{2,}\b')
    ;
    

    returning:

                                           string                                        
    -------------------------------------------------------------------------------------
     This is TEXT message
     HYPERLOOP is a concept
     Capture these WORDS
     FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY
    
  • marcothesanemarcothesane Employee

    To find your welcome string, it's the below.
    You might want to study the Internet for "Unix Regular Expressions in Perl", and especially the Posix Character class names, like here:
    https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html
    That said: [:punct:] is a Posix character class for punctuation - which seems to work if you put it into double left and right square brackets:

    WITH
    input(id,string) AS (
              SELECT 1,'Hi, How are you doing!'
    UNION ALL SELECT 2,'[email protected]#***'
    UNION ALL SELECT 3,'ÖV is public transport in German'
    UNION ALL SELECT 4,'HYPERLOOP is a concept'
    UNION ALL SELECT 5,'stock prices soar today'
    UNION ALL SELECT 6,'Laptop is great device'
    UNION ALL SELECT 7,'Capture these WORDS'              
    )
    SELECT
      *
    FROM input
    WHERE REGEXP_LIKE(string,'[[:punct:]]{2,}')
    ;
    

    Which returns:

    id |    string    
    ---+--------------
     2 | [email protected]#***
    
    
  • There is small mistake in question, I need condition to pick record -3 i.e. ÖV is public transport in German
    Record 3 is having non-English characters. Condition should accept all English characters and symbols but it should pick foreign language records.

  • @marcothesane Thanks for the link, I will refer the link as well.

  • marcothesanemarcothesane Employee

    They contain several consecutive uppercase alpha characters between word boundaries, @mosheg .
    Now, @shashank_p , if FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY (and even longer than that) is also a valid pattern you're looking for, I'll need to modify my WHERE condition from above, to : a word boundary, two or more uppercase alpha characters, a word boundary. And I use the posix character class identifer [:upper:] as above, to also accommodate "dirty" Latin letters:

    WITH
    input(string) AS (
              SELECT 'This is TEXT message'
    UNION ALL SELECT 'HYPERLOOP is a concept'
    UNION ALL SELECT 'stock prices soar today'
    UNION ALL SELECT 'Laptop is great device'
    UNION ALL SELECT 'Capture these WORDS'
    UNION ALL SELECT 'FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY'
    )
    SELECT
      *
    FROM input
    WHERE REGEXP_LIKE(string,'\b[[:upper:]]{2,}\b')
    ;
    

    returning:

                                           string                                        
    -------------------------------------------------------------------------------------
     This is TEXT message
     HYPERLOOP is a concept
     Capture these WORDS
     FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY
    
  • @marcothesane your posts are repeating, can you check once.

  • marcothesanemarcothesane Employee

    Every time I get into the Vertica Forum, my previous post is displayed as pending and not posted yet. I helped myself by just hitting the only button available - post.
    This time, I just delete the whole entry, and start anew.

    I went through the possibilities. I notice that, as Vertica's strings are UTF-8, anything [:alpha:] or [:alnum:] or even [:print:] includes non-ASCII single-byte characters. So I forced it, to say:

    Look for anything that contains at least one character that is either punctuation - [:punct:] or has an ASCII value between 128 and 255. In octal notation in a regular expression, you can express the latter as: [\200-\377] (where 200 octal is 128 and 377 octal is 255).

    So, I use: WHERE REGEXP_LIKE(string,'[[:punct:]\200-\377]') and seem to get what I want. The furst [ starts the character class I'm working with; [:punct:] means punctuation, and \200-\377 means anything between ASCII 128 and ASCII 255. The final ] ends the character class I'm searching for. I search for a string containing just one of those in the character class, otherwise I would add + to say "one or more", or {2,} to say "at least two", or {2,5} to say "between two and five".

    WITH
    input(id,string) AS (
              SELECT 1,'Hi, How are you doing!'
    UNION ALL SELECT 2,'[email protected]#***'
    UNION ALL SELECT 3,'ÖV is public transport in German'
    UNION ALL SELECT 4,'HYPERLOOP is a concept'
    UNION ALL SELECT 5,'stock prices soar today'
    UNION ALL SELECT 6,'Laptop is great device'
    UNION ALL SELECT 7,'Capture these WORDS'
    )
    SELECT
      *
    FROM input
    WHERE REGEXP_LIKE(string,'[[:punct:]\200-\377]')
    ;                                                     
    

    This returns, as we had hoped:

     id |              string              
    ----+----------------------------------
      1 | Hi, How are you doing!
      2 | [email protected]#***
      3 | ÖV is public transport in German
    

    That's because the ! "bang" or exclamation mark, and the comma, are also "punctuation".
    If you want to exclude a specific set of punctuation characters, you will have to put the collection between brackets, for example:

    [@#*\200-\377], to say, "at-sigh, or hash-sign or asterisk or anything between ASCII 128 and ASCII 255" .

  • Try logout and login, it should resolve.
    I'm clear with the punctuation part but as I mentioned above, I need records containing non-English language characters. Let me show you with some examples.

    id | string

    1 | ÖV is public transport in German
    2 | ÖÖÖÖ!!!
    3 | *!@sam
    4 | welcome
    5 | 123fg
    6 | Public Örchid

    Record - 1, 2 , 6 should only be the output. Again I need generic formula to identify non-English character data. My data contains some characters which are not English, I need to pull those records and make necessary corrections.
    How I did is, I found ASCII value for that non-English character and used LIKE condition, but draw back of this method is you need to know all characters which are non-English and then get ASCII value for those. Instead, is there any other way to identify those characters?

  • @marcothesane Did you get solution for above requirement?
    Vertica forum team, someone kindly suggest.

  • @SruthiA Thanks!

  • You were looking for an answer yesterday, @SruthiA

    Did you see my post above, from July 5th, beginning with

    "Every time I get into the Vertica Forum, ...."

    ?

    Does that one answer your question?
    Cheers -

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file

Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.