The Vertica Forum recently got a makeover! Let us know what you think by filling out this short, anonymous survey.
Please take this survey to help us learn more about how you use third party tools. Your input is greatly appreciated!

Performance problem with string functions

We identified that the performance of a query using some string functions, such as: strpos, position, ...  is really bad, even compared to equivalent using regular expressions, such as regexp_like that is in our case 4 times faster than strpos.

Tests have been done on one table (no join), or to specific projections (projection done only on one field with different encoding); profile queries ran from the MC.
Statistics up to date, Database designer: OK

The field (varchar2) has 570 different values; this is a 25M rows table. 600MB compressed.

Select count(distinct *) is running very fast (<1s).
select distinct fieldA is running very fast (<1s).
select distinct  strpos(fieldA , 'xxx') is very slow (between 40 to 400 seconds if mixed with other string functions).
select distinct regexp_like (fieldA , 'xxx') is faster (10 seconds)
select fieldA like 'xxx' from...  is even faster (7 seconds)

It is as if the function strpos was ran on each record, instead of applying the formula on each distinct.

Is it expected?

NB.
We are using Vertica 7.0.0, but we tried also with the last Vertica SP.
We tested it on a 3-nodes cluster; but also on a 1-node.
Projection we were using: AUTO
Size of the compressed data of the column: few KB.


-------- queries I used and some performance results:

-- original query
SELECTSTRPOS("sqlproxy"."PARTNERPROG_WAS",'XXX') > 0
"none_calculation_8970312171248507_nk" FROM 
 "t_bok_all_2v"
 "sqlproxy" GROUP BY 1 ORDER BY 1 ASC
294 seconds to complete

-- query I was expecting the optimizer to run based on statistics
SELECTSTRPOS("sqlproxy"."PARTNERPROG_WAS",'XXX') > 0
"none_calculation_8970312171248507_nk" FROM 
(
select distinct PARTNERPROG_WAS
from "t_bok_all_2v"
) "sqlproxy" GROUP BY 1 ORDER BY 1 ASC
18 seconds to complete

-- query I have to rewrite to:
SELECT("sqlproxy"."PARTNERPROG_WAS" like 'XXX') 
"none_calculation_8970312171248507_nk" FROM 
 "t_bok_all_2v"
 "sqlproxy" GROUP BY 1 ORDER BY 1 ASC
4 seconds to complete



This really sounds like a bug/regression, or am I missing anything?

Comments

  • Hi
    I face similar issue like this , it seems its related to RLE scan , which seems like slow when the data is being manupulate via string function .

    You do not post execution plan , however i was  able to improve the performence using the same method you use . 

    The idea is to avoid RLE scan using string functions , the inline query read the data  and decompress it , the out querty  run the sring function on top of uncompress data . 


  • The queries you are running are semantically different.
    strpos is asking for string containment
    like is actually equivalent to = in this case -- there are no %s in the comparison.
    The latter is much less expensive to compute.

    Vertica does have some specific optimizations for like, especially like with an equality (like 'XXX') or prefix (like 'XXX%') comparison.  Mostly these are for predicate evaluation performance.


  • 1. Would you know how to avoid RLE scan? did you use a specific encoding for it?
    2. I understand the functions are a bit different; ultimately, we are only interested by a flag (string is there: true, or not: false), but we are using Tableau to report the data, and we have no control on how the query is generated. Tableau is using strpos with a case statement and performance is horrible.
    3. Is there any workaround we could implement in the database side to to force vertica to apply the functions strpos on the encoded set, not on the uncompressed ones?

  • Hi

    The query you post are not readable ,Assuming this is your query :

    select f1,f2,,f3 from some_table
    Where  STRPOS("some_tables"."PARTNERPROG_WAS",'XXX') > 0   
    and f2=34234

    try rewrite the query to the below :

    select f1,f2,,f3 from (
    select f1,f2,,f3 ,PARTNERPROG_WAS,row_number() over (partition by null ) as rn 
     from some_table
    Where f2=34234 ) as inline
    where STRPOS(PARTNERPROG_WAS,'XXX') > 0 

    The use of the row_number function is for disable query rewriter by the optimizer , using this method  the data in the inline is materialize and the then you will be able to run the strpos on top of it .

    I hope it will help , i was very helpful on my use case.




  • This idea looks good, but I am not writing the query myself (Tableau is doing that); maybe we can have a way for Tableau to write it that way, or Vertica / Tableau can work on a (long term) fix?
  • Try to imlement the idea in a view and set tableau to use the view .

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file