We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now

Errors in Text index with FlexTokenizer — Vertica Forum

Errors in Text index with FlexTokenizer

I'm using the distributed HPE Vertica 7.2.1 Virtual Machine.

For demo purposes I'm playing with text search feature.


I built a flex table "actors" loading the following json file:

---- actors.json --------

{"firstName":"Stefano", "lastName":"Accorsi", "age": 44}
{"firstName":"George", "lastName":"Clooney", "age": 55}
{"firstName":"Robert", "lastName":"Redford", "age": 79}
{"firstName":"Jennifer", "lastName":"Lawrence", "age": 26}
{"firstName":"Nicole", "lastName":"Kidman", "age": 49}
{"firstName":"Cate", "lastName":"Blanchett", "age": 48}
{"firstName":"Marion", "lastName":"Cotillard", "age": 41}


create flex table actors();
copy actors from 'actors.json' parser fjsonparser();

select compute_flextable_keys('actors');
select build_flextable_view('actors');
update actors_keys SET data_type_guess = 'integer' where key_name = 'age';
select build_flextable_view('actors');


In the table (and view ) data are correct, a select returns:


select * from actors_view;
 age | firstname | lastname 
  44 | Stefano   | Accorsi
  55 | George    | Clooney
  79 | Robert    | Redford
  26 | Jennifer  | Lawrence
  49 | Nicole    | Kidman
  48 | Cate      | Blanchett
  41 | Marion    | Cotillard
(7 rows)


When I create a text index on this flex table with the tokenizer FlexTokenizer some tokens are truncated or modified (see tokens in red):


ALTER TABLE actors ADD PRIMARY KEY (__identity__);
CREATE TEXT INDEX actors_index ON actors(__identity__, __raw__) TOKENIZER public.FlexTokenizer(long varbinary);

SELECT * FROM actors_index;
   token   | doc_id
 26        |      4
 41        |      7
 44        |      1
 48        |      6
 49        |      5
 55        |      2
 79        |      3
 accorsi   |      1
 blanchett |      6
 cate      |      6
 cloonei   |      2
 cotillard |      7
 georg     |      2
 jennif    |      4
 kidman    |      5
 lawrenc   |      4
 marion    |      7
 nicol     |      5
 redford   |      3
 robert    |      3
 stefano   |      1
(21 rows)


Have I found a bug or I was wrong in some steps ?






Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file