We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Errors in Text index with FlexTokenizer — Vertica Forum

Errors in Text index with FlexTokenizer

I'm using the distributed HPE Vertica 7.2.1 Virtual Machine.

For demo purposes I'm playing with text search feature.

 

I built a flex table "actors" loading the following json file:

---- actors.json --------

{"firstName":"Stefano", "lastName":"Accorsi", "age": 44}
{"firstName":"George", "lastName":"Clooney", "age": 55}
{"firstName":"Robert", "lastName":"Redford", "age": 79}
{"firstName":"Jennifer", "lastName":"Lawrence", "age": 26}
{"firstName":"Nicole", "lastName":"Kidman", "age": 49}
{"firstName":"Cate", "lastName":"Blanchett", "age": 48}
{"firstName":"Marion", "lastName":"Cotillard", "age": 41}

----------

create flex table actors();
copy actors from 'actors.json' parser fjsonparser();

select compute_flextable_keys('actors');
select build_flextable_view('actors');
update actors_keys SET data_type_guess = 'integer' where key_name = 'age';
commit;
select build_flextable_view('actors');

 

In the table (and view ) data are correct, a select returns:

 

select * from actors_view;
 age | firstname | lastname 
-----+-----------+-----------
  44 | Stefano   | Accorsi
  55 | George    | Clooney
  79 | Robert    | Redford
  26 | Jennifer  | Lawrence
  49 | Nicole    | Kidman
  48 | Cate      | Blanchett
  41 | Marion    | Cotillard
(7 rows)

 

When I create a text index on this flex table with the tokenizer FlexTokenizer some tokens are truncated or modified (see tokens in red):

 

ALTER TABLE actors ADD PRIMARY KEY (__identity__);
CREATE TEXT INDEX actors_index ON actors(__identity__, __raw__) TOKENIZER public.FlexTokenizer(long varbinary);

SELECT * FROM actors_index;
   token   | doc_id
-----------+--------
 26        |      4
 41        |      7
 44        |      1
 48        |      6
 49        |      5
 55        |      2
 79        |      3
 accorsi   |      1
 blanchett |      6
 cate      |      6
 cloonei   |      2
 cotillard |      7
 georg     |      2
 jennif    |      4
 kidman    |      5
 lawrenc   |      4
 marion    |      7
 nicol     |      5
 redford   |      3
 robert    |      3
 stefano   |      1
(21 rows)

 

Have I found a bug or I was wrong in some steps ?

 

Thanks,

 

Chiara

Comments

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file