Find abbreviated words / Non - English letters in text

shashank_p · July 2020

Hi,
Text1: 'Vertica RTM version'
Text2: 'Vertica rtm version'
My requirement is to find Text1 cases from a column data. i.e. Text which contains abbreviated data.
Kindly assist.

marcothesane · July 2020

They contain several consecutive uppercase alpha characters between word boundaries, @mosheg .
Now, @shashank_p , if FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY (and even longer than that) is also a valid pattern you're looking for, I'll need to modify my WHERE condition from above, to : a word boundary, two or more uppercase alpha characters, a word boundary. And I use the posix character class identifer [:upper:] as above, to also accommodate "dirty" Latin letters:

WITH
input(string) AS (
          SELECT 'This is TEXT message'
UNION ALL SELECT 'HYPERLOOP is a concept'
UNION ALL SELECT 'stock prices soar today'
UNION ALL SELECT 'Laptop is great device'
UNION ALL SELECT 'Capture these WORDS'
UNION ALL SELECT 'FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY'
)
SELECT
  *
FROM input
WHERE REGEXP_LIKE(string,'\b[[:upper:]]{2,}\b')
;

returning:

                                       string                                        
-------------------------------------------------------------------------------------
 This is TEXT message
 HYPERLOOP is a concept
 Capture these WORDS
 FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY

SruthiA · July 2020

@shashank_p Please find it below

dbadmin=>

dbadmin=> select * from test_h1 WHERE regexp_ilike(string , '[^\x00-\x7F]+');
i | string
---+----------------------------------
1 | ÖV is public transport in German
2 | ÖÖÖÖ!!!
2 | Public Örchid
(3 rows)

mosheg · July 2020

If performance satisfies try the following example.
If not try to create a text index described here:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/AdministratorsGuide/Tables/TextSearch/TextSearchConceptual.htm
There is also an option to use a free text search with Vertica/Lucene++ UDx Integration package by Maurizio Felici.

CREATE TABLE public.t1 (row_id int, f1 varchar(50));
INSERT INTO t1 VALUES(1,'Vertica RTM version');
INSERT INTO t1 VALUES(2,'Vertica rtm version');
INSERT INTO t1 VALUES(3,'Vertica somthing else');
COMMIT;
SELECT * from t1 WHERE f1 ilike '%rtm%';
 row_id |         f1
--------+---------------------
      1 | Vertica RTM version
      2 | Vertica rtm version
(2 rows)

SELECT * from t1 WHERE f1 like '%rtm%';
 row_id |         f1
--------+---------------------
      2 | Vertica rtm version
(1 row)

shashank_p · July 2020

abbreviation is not static, it is dynamic for each row.

mosheg · July 2020

To better understand your need, kindly share few rows as sample data and a query example.

marcothesane · July 2020

Do you want to find:

find substrings that consist of , say, 2 to 5 capital alphabetic letters? and include accents and such, like in 'Ä' 'Ñ' ?
find these substrings, and then find out if they are really existing abbreviations, by looking up a dictionary ?

For the first , try:

WITH
input(string) AS (
          SELECT 'Vertica RTM Version'
UNION ALL SELECT 'Vertica rtm version'
UNION ALL SELECT 'Vertica MC version'
UNION ALL SELECT 'Vertica mc version'
UNION ALL SELECT 'This goes to the ROS'
UNION ALL SELECT 'This goes to the ros'
UNION ALL SELECT 'The wos is deprecated'
UNION ALL SELECT 'The WOS is deprecated'
)
SELECT
  *
FROM input
WHERE REGEXP_LIKE(string,'[A-Z]{2,4}') 
;

which would return:

        string         
-----------------------
 Vertica RTM Version
 Vertica MC version
 This goes to the ROS
 The WOS is deprecated

marcothesane · July 2020

To include strange letters: This one also gets the last row:

WITH
input(string) AS (
          SELECT 'Vertica RTM Version'
UNION ALL SELECT 'Vertica rtm version'
UNION ALL SELECT 'Vertica MC version'
UNION ALL SELECT 'Vertica mc version'
UNION ALL SELECT 'This goes to the ROS'
UNION ALL SELECT 'This goes to the ros'
UNION ALL SELECT 'The wos is deprecated'
UNION ALL SELECT 'The WOS is deprecated'
UNION ALL SELECT 'ÖV is public transport in German'
)
SELECT
  *
FROM input
WHERE REGEXP_LIKE(string,'[[:upper:]]{2,4}')        
;

shashank_p · July 2020

The abbreviations are dynamic it means it can be 2 letters / 3 letters etc. all characters are in English language.
Giving you some data for better understanding. Suppose column contains 5 records.

This is TEXT message
HYPERLOOP is a concept
stock prices soar today
Laptop is great device
Capture these WORDS

Only record 1 , 2 and 5 should be picked. Let me know if you need more information.

mosheg · July 2020

Why record 1 , 2 and 5 should be picked?
Does those records include a match string to one of the strings in a dynamic list?

shashank_p · July 2020

@marcothesane You got it and this is what I needed!
@mosheg I hope you got my question, there is no separate dynamic list, but there are dynamic abbreviations in the records.
Just a follow up question since it is regarding same concept. How to find records that contain Non - English characters.
For eg.
Record 1 - Hi, How are you doing!
Record 2 - Welcome@#***
Record 3 - ÖV is public transport in German
I need condition where it should pick only Record 2. Again I tried this with multiple data, the problem is that I'm not getting right condition which can ignore all symbols accepted in English.

marcothesane · July 2020

They contain several consecutive uppercase alpha characters between word boundaries, @mosheg .
Now, @shashank_p , if FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY (and even longer than that) is also a valid pattern you're looking for, I'll need to modify my WHERE condition from above, to : a word boundary, two or more uppercase alpha characters, a word boundary. And I use the posix character class identifer [:upper:] as above, to also accommodate "dirty" Latin letters:

WITH
input(string) AS (
          SELECT 'This is TEXT message'
UNION ALL SELECT 'HYPERLOOP is a concept'
UNION ALL SELECT 'stock prices soar today'
UNION ALL SELECT 'Laptop is great device'
UNION ALL SELECT 'Capture these WORDS'
UNION ALL SELECT 'FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY'
)
SELECT
  *
FROM input
WHERE REGEXP_LIKE(string,'\b[[:upper:]]{2,}\b')
;

returning:

                                       string                                        
-------------------------------------------------------------------------------------
 This is TEXT message
 HYPERLOOP is a concept
 Capture these WORDS
 FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY

marcothesane · July 2020

To find your welcome string, it's the below.
You might want to study the Internet for "Unix Regular Expressions in Perl", and especially the Posix Character class names, like here:
https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html
That said: [:punct:] is a Posix character class for punctuation - which seems to work if you put it into double left and right square brackets:

WITH
input(id,string) AS (
          SELECT 1,'Hi, How are you doing!'
UNION ALL SELECT 2,'Welcome@#***'
UNION ALL SELECT 3,'ÖV is public transport in German'
UNION ALL SELECT 4,'HYPERLOOP is a concept'
UNION ALL SELECT 5,'stock prices soar today'
UNION ALL SELECT 6,'Laptop is great device'
UNION ALL SELECT 7,'Capture these WORDS'              
)
SELECT
  *
FROM input
WHERE REGEXP_LIKE(string,'[[:punct:]]{2,}')
;

Which returns:

id |    string    
---+--------------
 2 | Welcome@#***

shashank_p · July 2020

There is small mistake in question, I need condition to pick record -3 i.e. ÖV is public transport in German
Record 3 is having non-English characters. Condition should accept all English characters and symbols but it should pick foreign language records.

shashank_p · July 2020

@marcothesane Thanks for the link, I will refer the link as well.

marcothesane · July 2020

They contain several consecutive uppercase alpha characters between word boundaries, @mosheg .
Now, @shashank_p , if FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY (and even longer than that) is also a valid pattern you're looking for, I'll need to modify my WHERE condition from above, to : a word boundary, two or more uppercase alpha characters, a word boundary. And I use the posix character class identifer [:upper:] as above, to also accommodate "dirty" Latin letters:

WITH
input(string) AS (
          SELECT 'This is TEXT message'
UNION ALL SELECT 'HYPERLOOP is a concept'
UNION ALL SELECT 'stock prices soar today'
UNION ALL SELECT 'Laptop is great device'
UNION ALL SELECT 'Capture these WORDS'
UNION ALL SELECT 'FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY'
)
SELECT
  *
FROM input
WHERE REGEXP_LIKE(string,'\b[[:upper:]]{2,}\b')
;

returning:

                                       string                                        
-------------------------------------------------------------------------------------
 This is TEXT message
 HYPERLOOP is a concept
 Capture these WORDS
 FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY

shashank_p · July 2020

@marcothesane your posts are repeating, can you check once.

marcothesane · July 2020

Every time I get into the Vertica Forum, my previous post is displayed as pending and not posted yet. I helped myself by just hitting the only button available - post.
This time, I just delete the whole entry, and start anew.

I went through the possibilities. I notice that, as Vertica's strings are UTF-8, anything [:alpha:] or [:alnum:] or even [:print:] includes non-ASCII single-byte characters. So I forced it, to say:

Look for anything that contains at least one character that is either punctuation - [:punct:] or has an ASCII value between 128 and 255. In octal notation in a regular expression, you can express the latter as: [\200-\377] (where 200 octal is 128 and 377 octal is 255).

So, I use: WHERE REGEXP_LIKE(string,'[[:punct:]\200-\377]') and seem to get what I want. The furst [ starts the character class I'm working with; [:punct:] means punctuation, and \200-\377 means anything between ASCII 128 and ASCII 255. The final ] ends the character class I'm searching for. I search for a string containing just one of those in the character class, otherwise I would add + to say "one or more", or {2,} to say "at least two", or {2,5} to say "between two and five".

WITH
input(id,string) AS (
          SELECT 1,'Hi, How are you doing!'
UNION ALL SELECT 2,'Welcome@#***'
UNION ALL SELECT 3,'ÖV is public transport in German'
UNION ALL SELECT 4,'HYPERLOOP is a concept'
UNION ALL SELECT 5,'stock prices soar today'
UNION ALL SELECT 6,'Laptop is great device'
UNION ALL SELECT 7,'Capture these WORDS'
)
SELECT
  *
FROM input
WHERE REGEXP_LIKE(string,'[[:punct:]\200-\377]')
;

This returns, as we had hoped:

 id |              string              
----+----------------------------------
  1 | Hi, How are you doing!
  2 | Welcome@#***
  3 | ÖV is public transport in German

That's because the ! "bang" or exclamation mark, and the comma, are also "punctuation".
If you want to exclude a specific set of punctuation characters, you will have to put the collection between brackets, for example:

[@#*\200-\377], to say, "at-sigh, or hash-sign or asterisk or anything between ASCII 128 and ASCII 255" .

shashank_p · July 2020

Try logout and login, it should resolve.
I'm clear with the punctuation part but as I mentioned above, I need records containing non-English language characters. Let me show you with some examples.

id | string

1 | ÖV is public transport in German
2 | ÖÖÖÖ!!!
3 | *!@sam
4 | welcome
5 | 123fg
6 | Public Örchid

Record - 1, 2 , 6 should only be the output. Again I need generic formula to identify non-English character data. My data contains some characters which are not English, I need to pull those records and make necessary corrections.
How I did is, I found ASCII value for that non-English character and used LIKE condition, but draw back of this method is you need to know all characters which are non-English and then get ASCII value for those. Instead, is there any other way to identify those characters?

shashank_p · July 2020

@marcothesane Did you get solution for above requirement?
Vertica forum team, someone kindly suggest.

shashank_p · July 2020

@SruthiA Thanks!

marcothesane · July 2020

You were looking for an answer yesterday, @SruthiA

Did you see my post above, from July 5th, beginning with

"Every time I get into the Vertica Forum, ...."

?

Does that one answer your question?
Cheers -

We're Moving!

Create My New Community Account Now

Find abbreviated words / Non - English letters in text

Best Answers

Answers

id | string

Leave a Comment