We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


parsing vertica sql dialect — Vertica Forum

parsing vertica sql dialect

Hello

I have a daily datamart calculation routine. There are 300+ datamarts and they depend on each other. And as of now I need to run datamart and query projection_usage to calculate dependencies for a new datamart. But datamarts became too wide and overdependent. So I came up with an idea of column-level dependencies to separate some calculations. And there is no column_usage. So I need to build and maintain sql statement model to analyze column-level dependencies. Or split statements on ddl-only and dml-only, execute all ddl and run analyze on dml. It looks like something vertica already does under the hood.

So can you advise me with projection_usage-like table for columns or can you provide me with sql grammar for vertica dialect? Or maybe there is a parser for vertica sql i did not find

Comments

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    What are you looking for? A way to check which columns from a projection were queried?

  • Yeap, I'm looking for a way to check columns needed to execute query. I can do this by splitting statements like create local temp table as select on ddl and dml and explaining dml part or by parsing. And with parsing I can do much more. I can get how resulting columns depend on input columns for example.

    So I'm wondering if there is projection_usage-like table to get columns for recent queries or if I can somehow get grammar for vertica sql so I can make my own parser and parse queries before running them.

    Honestly I have another task to point out to bi analysts what is bad in a query before executing it. So parser for vertica sql dialect is something that would solve both issues.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Maybe the data collector table DC_EXPLAIN_PLANS can help?

    Quick example:

    dbadmin=> SELECT a, b FROM test;
     a | b
    ---+---
     1 | 2
    (1 row)
    
    dbadmin=> SELECT TRIM(BOTH SPLIT_PART(path_line, 'Materialize:', 2)) query_columns FROM dc_explain_plans WHERE INSTR(path_line, 'Filter') = 0 AND INSTR(path_line, 'Materialize:') > 0 AND transaction_id = current_trans_id() ORDER BY time DESC;
                        query_columns
    -----------------------------------------------------
     dc_explain_plans."time", dc_explain_plans.path_line
     test.a, test.b
    (2 rows)
    
  • Well I can try it. So you suggest me to parse explains instead of queries and correlate all early&late materializations with corresponding join/storage access stages to decode aliases.

  • phil2phil2
    edited April 2019

    Here's what I got:

      create or replace view column_usage as 
      select time, user_name, session_id, transaction_id, statement_id, request_id,  
        coalesce(proj_name, name_lookup) as proj_name, 
        coalesce(early_materialization, late_materialization) as columns
      from (
        select *, 
          substr(min(lpad(path_id::varchar, 3, '0') || proj_name)
                 over(partition by user_name, session_id, transaction_id, statement_id, request_id, coalesce(alias_lookup, proj_alias)
                      order by path_id rows between 1 FOLLOWING and UNBOUNDED FOLLOWING),
                 4) as name_lookup
        from (
          select *, regexp_substr(late_materialization, '^[^.]+') as alias_lookup
          from (
            select user_name, session_id, transaction_id, statement_id, request_id, path_id,
              max(length(regexp_substr(path_line, '^([|] )+'))) as path_depth,
              max(regexp_substr(path_line, 'STORAGE ACCESS for (\S+)', 1, 1, '', 1)) as proj_alias,
              max(regexp_substr(path_line, 'Projection: (\S+)', 1, 1, '', 1)) as proj_name,
              max(regexp_substr(path_line, 'Materialize: (([^.]+\.[^, ]+)*)', 1, 1, '', 1)) as early_materialization,
              max(regexp_substr(path_line, 'Materialize at Output: (([^.]+\.[^, ]+)*)', 1, 1, '', 1)) as late_materialization,
              min(time) as time
            from DC_EXPLAIN_PLANS  
            where (path_line like '%Materialize%' or path_line like '%STORAGE ACCESS%' or path_line like '%Projection%') 
              and path_id > -1
            group by 1,2,3,4,5,6
          ) t 
        ) t 
      ) t 
    

    And now I know input columns. But I still cannot trace correlation between input and output columns. Any ideas how can I do this?

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file