parsing vertica sql dialect

phil2 · April 2019

Hello

I have a daily datamart calculation routine. There are 300+ datamarts and they depend on each other. And as of now I need to run datamart and query projection_usage to calculate dependencies for a new datamart. But datamarts became too wide and overdependent. So I came up with an idea of column-level dependencies to separate some calculations. And there is no column_usage. So I need to build and maintain sql statement model to analyze column-level dependencies. Or split statements on ddl-only and dml-only, execute all ddl and run analyze on dml. It looks like something vertica already does under the hood.

So can you advise me with projection_usage-like table for columns or can you provide me with sql grammar for vertica dialect? Or maybe there is a parser for vertica sql i did not find

Jim_Knicely · April 2019

What are you looking for? A way to check which columns from a projection were queried?

phil2 · April 2019

Yeap, I'm looking for a way to check columns needed to execute query. I can do this by splitting statements like create local temp table as select on ddl and dml and explaining dml part or by parsing. And with parsing I can do much more. I can get how resulting columns depend on input columns for example.

So I'm wondering if there is projection_usage-like table to get columns for recent queries or if I can somehow get grammar for vertica sql so I can make my own parser and parse queries before running them.

Honestly I have another task to point out to bi analysts what is bad in a query before executing it. So parser for vertica sql dialect is something that would solve both issues.

Jim_Knicely · April 2019

Maybe the data collector table DC_EXPLAIN_PLANS can help?

Quick example:

dbadmin=> SELECT a, b FROM test;
 a | b
---+---
 1 | 2
(1 row)

dbadmin=> SELECT TRIM(BOTH SPLIT_PART(path_line, 'Materialize:', 2)) query_columns FROM dc_explain_plans WHERE INSTR(path_line, 'Filter') = 0 AND INSTR(path_line, 'Materialize:') > 0 AND transaction_id = current_trans_id() ORDER BY time DESC;
                    query_columns
-----------------------------------------------------
 dc_explain_plans."time", dc_explain_plans.path_line
 test.a, test.b
(2 rows)

phil2 · April 2019

Well I can try it. So you suggest me to parse explains instead of queries and correlate all early&late materializations with corresponding join/storage access stages to decode aliases.

phil2 · April 2019

Here's what I got:

  create or replace view column_usage as 
  select time, user_name, session_id, transaction_id, statement_id, request_id,  
    coalesce(proj_name, name_lookup) as proj_name, 
    coalesce(early_materialization, late_materialization) as columns
  from (
    select *, 
      substr(min(lpad(path_id::varchar, 3, '0') || proj_name)
             over(partition by user_name, session_id, transaction_id, statement_id, request_id, coalesce(alias_lookup, proj_alias)
                  order by path_id rows between 1 FOLLOWING and UNBOUNDED FOLLOWING),
             4) as name_lookup
    from (
      select *, regexp_substr(late_materialization, '^[^.]+') as alias_lookup
      from (
        select user_name, session_id, transaction_id, statement_id, request_id, path_id,
          max(length(regexp_substr(path_line, '^([|] )+'))) as path_depth,
          max(regexp_substr(path_line, 'STORAGE ACCESS for (\S+)', 1, 1, '', 1)) as proj_alias,
          max(regexp_substr(path_line, 'Projection: (\S+)', 1, 1, '', 1)) as proj_name,
          max(regexp_substr(path_line, 'Materialize: (([^.]+\.[^, ]+)*)', 1, 1, '', 1)) as early_materialization,
          max(regexp_substr(path_line, 'Materialize at Output: (([^.]+\.[^, ]+)*)', 1, 1, '', 1)) as late_materialization,
          min(time) as time
        from DC_EXPLAIN_PLANS  
        where (path_line like '%Materialize%' or path_line like '%STORAGE ACCESS%' or path_line like '%Projection%') 
          and path_id > -1
        group by 1,2,3,4,5,6
      ) t 
    ) t 
  ) t

And now I know input columns. But I still cannot trace correlation between input and output columns. Any ideas how can I do this?

parsing vertica sql dialect

Comments

Leave a Comment