Optimizing Vertica Schemas for a Low Latency Application
What best practices are recommended to optimize Vertica for a UI dashboard? We use Vertica to populate a Reporting UI for our clients, it runs about 1 million queries per day on tables with rows in the hundreds of billions.
Currently we use 3 fact tables. A common use case is unioning a subset of columns from the 3 tables into a common view.
Ideally, we would like to have all of our UI based queries to take <500ms for a month of data (we store in an hourly format).
What approaches and schemas should I consider at a high level? Do flattened tables fit this type of use case? Are there other schema level optimizations I should consider? An obvious one is a new table eliminates the need for unioning. Any other approaches that folks here use, or is Vertica not even well suited to this type of use case? Thanks!
Comments
1) Flattened tables will run faster joins and will save the need to code joins, but not unions.
2) If the tables you need to run the union on, have partitions and have the same schema structure, you can do COPY_PARTITIONS_TO_TABLE from one table to another. This lightweight partition copy increases performance by initially sharing the same storage between two tables.
3) If possible prefer to run calculations once in load time by adding aggregations to partition base projections or LAPs to save the need to run repeated calculations or full scan in queries run time.
4) UNION ALL performs better than UNION since the database server does not have to eliminate duplicate rows which can be costly.
5) If without UNION your query works faster, if possible prefer using OR instead of UNION nor UNION ALL.
6) Queries with very large result set from multiple nodes can be slow.
Even when redirecting output to /dev/null it still route all data through one channel (query initiator).
A possible alternative to “leave” data in the nodes where it was originated is to use the following q_result table.
The scope of the “ksafe 0” at the end is to avoid buddy projection creation.
In addition, if your UNION queries run faster as separated individual queries, create a temp table as shown below to hold the result-set of the individual queries and then select from the temp table.
7) Check if you have enough infrastructure to very your concurrent queries will not be queued.
If you have one million queries per 24h and the load is spread evenly between the hours, you need memory to satisfy 11.6 queries per Sec. plus the ETL concurrent activity.
To confirm the above, and see on average how many queries run per hour and what is the average elapsed time,
Do the following check in vsql:
8) Then check how much memory your query need (per node) use something like the following:
Hi @mosheg, this is very helpful! I think I understand all of these, except for point 6. Are you saying that for some query, I can:
Do you know of any documentation about this approach, either from Vertica or other uses? Or am I misunderstanding what you mean?
Yes.
Another temp table performance advantage is that it requires less locks.
The approach in (6) is not mentioned in the documentation, I've learned it from Maurizio.
Ok, we have a training with him on Tuesday, I'll ask for more details then, thank you!