parsing vertica sql dialect
Hello
I have a daily datamart calculation routine. There are 300+ datamarts and they depend on each other. And as of now I need to run datamart and query projection_usage to calculate dependencies for a new datamart. But datamarts became too wide and overdependent. So I came up with an idea of column-level dependencies to separate some calculations. And there is no column_usage. So I need to build and maintain sql statement model to analyze column-level dependencies. Or split statements on ddl-only and dml-only, execute all ddl and run analyze on dml. It looks like something vertica already does under the hood.
So can you advise me with projection_usage-like table for columns or can you provide me with sql grammar for vertica dialect? Or maybe there is a parser for vertica sql i did not find
Comments
What are you looking for? A way to check which columns from a projection were queried?
Yeap, I'm looking for a way to check columns needed to execute query. I can do this by splitting statements like create local temp table as select on ddl and dml and explaining dml part or by parsing. And with parsing I can do much more. I can get how resulting columns depend on input columns for example.
So I'm wondering if there is projection_usage-like table to get columns for recent queries or if I can somehow get grammar for vertica sql so I can make my own parser and parse queries before running them.
Honestly I have another task to point out to bi analysts what is bad in a query before executing it. So parser for vertica sql dialect is something that would solve both issues.
Maybe the data collector table DC_EXPLAIN_PLANS can help?
Quick example:
dbadmin=> SELECT a, b FROM test; a | b ---+--- 1 | 2 (1 row) dbadmin=> SELECT TRIM(BOTH SPLIT_PART(path_line, 'Materialize:', 2)) query_columns FROM dc_explain_plans WHERE INSTR(path_line, 'Filter') = 0 AND INSTR(path_line, 'Materialize:') > 0 AND transaction_id = current_trans_id() ORDER BY time DESC; query_columns ----------------------------------------------------- dc_explain_plans."time", dc_explain_plans.path_line test.a, test.b (2 rows)Well I can try it. So you suggest me to parse explains instead of queries and correlate all early&late materializations with corresponding join/storage access stages to decode aliases.
Here's what I got:
create or replace view column_usage as select time, user_name, session_id, transaction_id, statement_id, request_id, coalesce(proj_name, name_lookup) as proj_name, coalesce(early_materialization, late_materialization) as columns from ( select *, substr(min(lpad(path_id::varchar, 3, '0') || proj_name) over(partition by user_name, session_id, transaction_id, statement_id, request_id, coalesce(alias_lookup, proj_alias) order by path_id rows between 1 FOLLOWING and UNBOUNDED FOLLOWING), 4) as name_lookup from ( select *, regexp_substr(late_materialization, '^[^.]+') as alias_lookup from ( select user_name, session_id, transaction_id, statement_id, request_id, path_id, max(length(regexp_substr(path_line, '^([|] )+'))) as path_depth, max(regexp_substr(path_line, 'STORAGE ACCESS for (\S+)', 1, 1, '', 1)) as proj_alias, max(regexp_substr(path_line, 'Projection: (\S+)', 1, 1, '', 1)) as proj_name, max(regexp_substr(path_line, 'Materialize: (([^.]+\.[^, ]+)*)', 1, 1, '', 1)) as early_materialization, max(regexp_substr(path_line, 'Materialize at Output: (([^.]+\.[^, ]+)*)', 1, 1, '', 1)) as late_materialization, min(time) as time from DC_EXPLAIN_PLANS where (path_line like '%Materialize%' or path_line like '%STORAGE ACCESS%' or path_line like '%Projection%') and path_id > -1 group by 1,2,3,4,5,6 ) t ) t ) tAnd now I know input columns. But I still cannot trace correlation between input and output columns. Any ideas how can I do this?