Sql on Hadoop

Hi there!


How the runtime performance of Vertica SQL on Hadoop can be compared with Vertica using native storage, given the same amount of nodes and HDD performance in both cases?


Has anyone compared Vertica SQL on Hadoop with Facebook Presto on Hadoop, given the Presto-Workers are running on the same hardware as Vertica nodes, and the same Hadoop cluster?





  • Options

    Hadoop vs native-Vertica performance depends hugely on the queries that you're running, and on the size and performance of your network:  The disadvantage of HDFS is that it doesn't (yet) have strong support for data locality.  If you have to read a table, that's a network operation.  That adds latency (which slows short queries); it also consumes bandwidth (which slows read-intensive queries on networks with limited cross-sectional bandwidth).


    Vertica's internal storage engine works very hard to collocate data within and between different tables such that, given a typical query, all nodes already have all the data that they need locally.  We've studied this a bunch and have various tricks to do it well, many of which are outlined further in our documentation.


    Basically, you can do the math -- how much data has to be moved to answer this query?; how much of a strain is that going to be on your network?  That's a reasonably accurate predictor of the performance difference of "Vertica storing data internally" vs "Vertica storing data in HDFS."  You can make Vertica's internal storage mechanism look arbitrarily good simply by setting up a really slow network.


    I unfortunately don't have numbers comparing with Presto.



Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file