Find abbreviated words / Non - English letters in text
Hi,
Text1: 'Vertica RTM version'
Text2: 'Vertica rtm version'
My requirement is to find Text1 cases from a column data. i.e. Text which contains abbreviated data.
Kindly assist.
Best Answers
-
marcothesane - Select Field - Administrator
They contain several consecutive uppercase alpha characters between word boundaries, @mosheg .
Now, @shashank_p , if FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY (and even longer than that) is also a valid pattern you're looking for, I'll need to modify my WHERE condition from above, to : a word boundary, two or more uppercase alpha characters, a word boundary. And I use the posix character class identifer [:upper:] as above, to also accommodate "dirty" Latin letters:WITH input(string) AS ( SELECT 'This is TEXT message' UNION ALL SELECT 'HYPERLOOP is a concept' UNION ALL SELECT 'stock prices soar today' UNION ALL SELECT 'Laptop is great device' UNION ALL SELECT 'Capture these WORDS' UNION ALL SELECT 'FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY' ) SELECT * FROM input WHERE REGEXP_LIKE(string,'\b[[:upper:]]{2,}\b') ;
returning:
string ------------------------------------------------------------------------------------- This is TEXT message HYPERLOOP is a concept Capture these WORDS FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY
4 -
SruthiA Vertica Employee Administrator
@shashank_p Please find it below
dbadmin=> select * from test_h1;
i | string
---+----------------------------------
1 | ÖV is public transport in German
2 | ÖÖÖÖ!!!
2 | *!@sam
2 | 123fg
2 | Public Örchid
(5 rows)dbadmin=>
dbadmin=> select * from test_h1 WHERE regexp_ilike(string , '[^\x00-\x7F]+');
i | string
---+----------------------------------
1 | ÖV is public transport in German
2 | ÖÖÖÖ!!!
2 | Public Örchid
(3 rows)0
Answers
If performance satisfies try the following example.
If not try to create a text index described here:
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/AdministratorsGuide/Tables/TextSearch/TextSearchConceptual.htm
There is also an option to use a free text search with Vertica/Lucene++ UDx Integration package by Maurizio Felici.
abbreviation is not static, it is dynamic for each row.
To better understand your need, kindly share few rows as sample data and a query example.
Do you want to find:
For the first , try:
which would return:
To include strange letters: This one also gets the last row:
The abbreviations are dynamic it means it can be 2 letters / 3 letters etc. all characters are in English language.
Giving you some data for better understanding. Suppose column contains 5 records.
This is TEXT message
HYPERLOOP is a concept
stock prices soar today
Laptop is great device
Capture these WORDS
Only record 1 , 2 and 5 should be picked. Let me know if you need more information.
Why record 1 , 2 and 5 should be picked?
Does those records include a match string to one of the strings in a dynamic list?
@marcothesane You got it and this is what I needed!
@mosheg I hope you got my question, there is no separate dynamic list, but there are dynamic abbreviations in the records.
Just a follow up question since it is regarding same concept. How to find records that contain Non - English characters.
For eg.
Record 1 - Hi, How are you doing!
Record 2 - Welcome@#***
Record 3 - ÖV is public transport in German
I need condition where it should pick only Record 2. Again I tried this with multiple data, the problem is that I'm not getting right condition which can ignore all symbols accepted in English.
They contain several consecutive uppercase alpha characters between word boundaries, @mosheg .
Now, @shashank_p , if FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY (and even longer than that) is also a valid pattern you're looking for, I'll need to modify my WHERE condition from above, to : a word boundary, two or more uppercase alpha characters, a word boundary. And I use the posix character class identifer [:upper:] as above, to also accommodate "dirty" Latin letters:
returning:
To find your welcome string, it's the below.
You might want to study the Internet for "Unix Regular Expressions in Perl", and especially the Posix Character class names, like here:
https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html
That said:
[:punct:]
is a Posix character class for punctuation - which seems to work if you put it into double left and right square brackets:Which returns:
There is small mistake in question, I need condition to pick record -3 i.e. ÖV is public transport in German
Record 3 is having non-English characters. Condition should accept all English characters and symbols but it should pick foreign language records.
@marcothesane Thanks for the link, I will refer the link as well.
They contain several consecutive uppercase alpha characters between word boundaries, @mosheg .
Now, @shashank_p , if FAROUTINTHEUNCHARTEDBACKWATERSOFTHEUNFASHIONABLEENDOFTHEWESTERNSPIRALARMOFTHEGALAXY (and even longer than that) is also a valid pattern you're looking for, I'll need to modify my WHERE condition from above, to : a word boundary, two or more uppercase alpha characters, a word boundary. And I use the posix character class identifer [:upper:] as above, to also accommodate "dirty" Latin letters:
returning:
@marcothesane your posts are repeating, can you check once.
Every time I get into the Vertica Forum, my previous post is displayed as pending and not posted yet. I helped myself by just hitting the only button available -
post
.This time, I just delete the whole entry, and start anew.
I went through the possibilities. I notice that, as Vertica's strings are UTF-8, anything
[:alpha:]
or[:alnum:]
or even[:print:]
includes non-ASCII single-byte characters. So I forced it, to say:Look for anything that contains at least one character that is either punctuation -
[:punct:]
or has an ASCII value between 128 and 255. In octal notation in a regular expression, you can express the latter as:[\200-\377]
(where 200 octal is 128 and 377 octal is 255).So, I use:
WHERE REGEXP_LIKE(string,'[[:punct:]\200-\377]')
and seem to get what I want. The furst[
starts the character class I'm working with;[:punct:]
means punctuation, and\200-\377
means anything between ASCII 128 and ASCII 255. The final]
ends the character class I'm searching for. I search for a string containing just one of those in the character class, otherwise I would add+
to say "one or more", or{2,}
to say "at least two", or{2,5}
to say "between two and five".This returns, as we had hoped:
That's because the
!
"bang" or exclamation mark, and the comma, are also "punctuation".If you want to exclude a specific set of punctuation characters, you will have to put the collection between brackets, for example:
[@#*\200-\377]
, to say, "at-sigh, or hash-sign or asterisk or anything between ASCII 128 and ASCII 255" .Try logout and login, it should resolve.
I'm clear with the punctuation part but as I mentioned above, I need records containing non-English language characters. Let me show you with some examples.
id | string
1 | ÖV is public transport in German
2 | ÖÖÖÖ!!!
3 | *!@sam
4 | welcome
5 | 123fg
6 | Public Örchid
Record - 1, 2 , 6 should only be the output. Again I need generic formula to identify non-English character data. My data contains some characters which are not English, I need to pull those records and make necessary corrections.
How I did is, I found ASCII value for that non-English character and used LIKE condition, but draw back of this method is you need to know all characters which are non-English and then get ASCII value for those. Instead, is there any other way to identify those characters?
@marcothesane Did you get solution for above requirement?
Vertica forum team, someone kindly suggest.
@SruthiA Thanks!
You were looking for an answer yesterday, @SruthiA
Did you see my post above, from July 5th, beginning with
"Every time I get into the Vertica Forum, ...."
?
Does that one answer your question?
Cheers -