parsing vertica sql dialect

phil2phil2 Registered User


I have a daily datamart calculation routine. There are 300+ datamarts and they depend on each other. And as of now I need to run datamart and query projection_usage to calculate dependencies for a new datamart. But datamarts became too wide and overdependent. So I came up with an idea of column-level dependencies to separate some calculations. And there is no column_usage. So I need to build and maintain sql statement model to analyze column-level dependencies. Or split statements on ddl-only and dml-only, execute all ddl and run analyze on dml. It looks like something vertica already does under the hood.

So can you advise me with projection_usage-like table for columns or can you provide me with sql grammar for vertica dialect? Or maybe there is a parser for vertica sql i did not find


  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    What are you looking for? A way to check which columns from a projection were queried?

  • phil2phil2 Registered User

    Yeap, I'm looking for a way to check columns needed to execute query. I can do this by splitting statements like create local temp table as select on ddl and dml and explaining dml part or by parsing. And with parsing I can do much more. I can get how resulting columns depend on input columns for example.

    So I'm wondering if there is projection_usage-like table to get columns for recent queries or if I can somehow get grammar for vertica sql so I can make my own parser and parse queries before running them.

    Honestly I have another task to point out to bi analysts what is bad in a query before executing it. So parser for vertica sql dialect is something that would solve both issues.

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    Maybe the data collector table DC_EXPLAIN_PLANS can help?

    Quick example:

    dbadmin=> SELECT a, b FROM test;
     a | b
     1 | 2
    (1 row)
    dbadmin=> SELECT TRIM(BOTH SPLIT_PART(path_line, 'Materialize:', 2)) query_columns FROM dc_explain_plans WHERE INSTR(path_line, 'Filter') = 0 AND INSTR(path_line, 'Materialize:') > 0 AND transaction_id = current_trans_id() ORDER BY time DESC;
     dc_explain_plans."time", dc_explain_plans.path_line
     test.a, test.b
    (2 rows)
  • phil2phil2 Registered User

    Well I can try it. So you suggest me to parse explains instead of queries and correlate all early&late materializations with corresponding join/storage access stages to decode aliases.

  • phil2phil2 Registered User
    edited April 24

    Here's what I got:

      create or replace view column_usage as 
      select time, user_name, session_id, transaction_id, statement_id, request_id,  
        coalesce(proj_name, name_lookup) as proj_name, 
        coalesce(early_materialization, late_materialization) as columns
      from (
        select *, 
          substr(min(lpad(path_id::varchar, 3, '0') || proj_name)
                 over(partition by user_name, session_id, transaction_id, statement_id, request_id, coalesce(alias_lookup, proj_alias)
                      order by path_id rows between 1 FOLLOWING and UNBOUNDED FOLLOWING),
                 4) as name_lookup
        from (
          select *, regexp_substr(late_materialization, '^[^.]+') as alias_lookup
          from (
            select user_name, session_id, transaction_id, statement_id, request_id, path_id,
              max(length(regexp_substr(path_line, '^([|] )+'))) as path_depth,
              max(regexp_substr(path_line, 'STORAGE ACCESS for (\S+)', 1, 1, '', 1)) as proj_alias,
              max(regexp_substr(path_line, 'Projection: (\S+)', 1, 1, '', 1)) as proj_name,
              max(regexp_substr(path_line, 'Materialize: (([^.]+\.[^, ]+)*)', 1, 1, '', 1)) as early_materialization,
              max(regexp_substr(path_line, 'Materialize at Output: (([^.]+\.[^, ]+)*)', 1, 1, '', 1)) as late_materialization,
              min(time) as time
            from DC_EXPLAIN_PLANS  
            where (path_line like '%Materialize%' or path_line like '%STORAGE ACCESS%' or path_line like '%Projection%') 
              and path_id > -1
            group by 1,2,3,4,5,6
          ) t 
        ) t 
      ) t 

    And now I know input columns. But I still cannot trace correlation between input and output columns. Any ideas how can I do this?

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file