Errors in Text index with FlexTokenizer
I'm using the distributed HPE Vertica 7.2.1 Virtual Machine.
For demo purposes I'm playing with text search feature.
I built a flex table "actors" loading the following json file:
---- actors.json --------
{"firstName":"Stefano", "lastName":"Accorsi", "age": 44}
{"firstName":"George", "lastName":"Clooney", "age": 55}
{"firstName":"Robert", "lastName":"Redford", "age": 79}
{"firstName":"Jennifer", "lastName":"Lawrence", "age": 26}
{"firstName":"Nicole", "lastName":"Kidman", "age": 49}
{"firstName":"Cate", "lastName":"Blanchett", "age": 48}
{"firstName":"Marion", "lastName":"Cotillard", "age": 41}
----------
create flex table actors();
copy actors from 'actors.json' parser fjsonparser();
select compute_flextable_keys('actors');
select build_flextable_view('actors');
update actors_keys SET data_type_guess = 'integer' where key_name = 'age';
commit;
select build_flextable_view('actors');
In the table (and view ) data are correct, a select returns:
select * from actors_view;
age | firstname | lastname
-----+-----------+-----------
44 | Stefano | Accorsi
55 | George | Clooney
79 | Robert | Redford
26 | Jennifer | Lawrence
49 | Nicole | Kidman
48 | Cate | Blanchett
41 | Marion | Cotillard
(7 rows)
When I create a text index on this flex table with the tokenizer FlexTokenizer some tokens are truncated or modified (see tokens in red):
ALTER TABLE actors ADD PRIMARY KEY (__identity__);
CREATE TEXT INDEX actors_index ON actors(__identity__, __raw__) TOKENIZER public.FlexTokenizer(long varbinary);
SELECT * FROM actors_index;
token | doc_id
-----------+--------
26 | 4
41 | 7
44 | 1
48 | 6
49 | 5
55 | 2
79 | 3
accorsi | 1
blanchett | 6
cate | 6
cloonei | 2
cotillard | 7
georg | 2
jennif | 4
kidman | 5
lawrenc | 4
marion | 7
nicol | 5
redford | 3
robert | 3
stefano | 1
(21 rows)
Have I found a bug or I was wrong in some steps ?
Thanks,
Chiara
Comments
Hi
Its not a bug , vertica like other text engine (eg:elastic ) by defulte persists Stemming version of the words , in order to enrich the query result output , more deatils on Stemming you can find in https://en.wikipedia.org/wiki/Stemming
Vertica example using its internal Stemmer function you find her function https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/AdministratorsGuide/Tables/SearchingaTextIndex.htm
I hope its answer you quetion
Thanks