Errors in Text index with FlexTokenizer

ChiaraB · January 2016

I'm using the distributed HPE Vertica 7.2.1 Virtual Machine.

For demo purposes I'm playing with text search feature.

I built a flex table "actors" loading the following json file:

---- actors.json --------

{"firstName":"Stefano", "lastName":"Accorsi", "age": 44}
{"firstName":"George", "lastName":"Clooney", "age": 55}
{"firstName":"Robert", "lastName":"Redford", "age": 79}
{"firstName":"Jennifer", "lastName":"Lawrence", "age": 26}
{"firstName":"Nicole", "lastName":"Kidman", "age": 49}
{"firstName":"Cate", "lastName":"Blanchett", "age": 48}
{"firstName":"Marion", "lastName":"Cotillard", "age": 41}

----------

create flex table actors();
copy actors from 'actors.json' parser fjsonparser();

select compute_flextable_keys('actors');
select build_flextable_view('actors');
update actors_keys SET data_type_guess = 'integer' where key_name = 'age';
commit;
select build_flextable_view('actors');

In the table (and view ) data are correct, a select returns:

When I create a text index on this flex table with the tokenizer FlexTokenizer some tokens are truncated or modified (see tokens in red):

ALTER TABLE actors ADD PRIMARY KEY (__identity__);
CREATE TEXT INDEX actors_index ON actors(__identity__, __raw__) TOKENIZER public.FlexTokenizer(long varbinary);

Have I found a bug or I was wrong in some steps ?

Thanks,

Chiara

eli_revach · January 2016

Hi

Its not a bug , vertica like other text engine (eg:elastic ) by defulte persists Stemming version of the words , in order to enrich the query result output , more deatils on Stemming you can find in https://en.wikipedia.org/wiki/Stemming

Vertica example using its internal Stemmer function you find her function https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/AdministratorsGuide/Tables/SearchingaTextIndex.htm

I hope its answer you quetion

Thanks

We're Moving!

Create My New Community Account Now

Errors in Text index with FlexTokenizer

Comments

Leave a Comment