SELECT statement performance improvements?
rjs_docs
Employee
Do you plan to improve the performance of SELECT statements on large datasets (rows sent in batches)?
@twall
From Extending Vertica with the Latest Vertica Ecosystem and Open Source Initiatives.
0
Answers
This is a complicated question with many different answers. The details will depend a lot on the specific problem you are trying to solve.
In terms of specific clients like vertica-python, the priority has been getting the full protocol implemented. There's certainly some performance gains to be had but getting it functionally complete has been our recent focus.
For lots of use cases there's not a lot of data movement to & from the client -- it is usually best to keep the data in the database when possible. Running copy commands, querying for summaries & aggregates, etc. don't require a lot of data to be processed on the client-side. Python and the dbapi standard are not well suited to very large scale data exports, but if you need to do so using a runtime like pypy can really help.
There are better options using things like export to s3, to kafka, and hdfs that leverage parallelism across the cluster to write to parallel systems, often in column-oriented formats like parquet. You can drive the execution of those kind of commands with vertica-python, but with those option data flows out of band at a much greater scale than a single client TCP connection to a single node ever could.