Performance at Scale

Navin_C · April 2014

Hi All,

The topic of this question is actually a Vertica tag "Performance at Scale"

My question is regarding the behavior of Vertica with respect to different scenarios

Scenario 1
We have a 5 Node cluster and each with 6TB disk space and 256 GB RAM
1 Database
10.5TB of Raw data(from compliance status)
2.5TB of compressed data

We perform a benchmark test on this database and find our queries are taking 89 minutes (which is fine as there are many queries hitting vertica)

Now we increase the cluster size to 10 Node and data size to double
1 Database
20 TB of Raw data(from compliance status)
5TB of compressed data

Now when execute same queries on vertica it takes same 89 minutes.

Note - we have replicated the data with different schema_name to make it to 20TB
Previously there were 10 Schema now there are 20 schema.

And all our queries are hitting the schemas they are meant to hit(both replicated and original)

This is good for us that even with 20TB it is taking same time.
Now the question is: How do we explaiin this behaviour of vertica

Is is that there is some breakup point after which if every node is loaded with the data the performance will be affected .

Scenario 2:
We have 10TB data on 5 nodes and our test takes 120 minutes to execute all queries on database
Now we increse the size of cluster from 5 nodes to 10 nodes
Now the data is distributed among all 10 nodes equally

we run the test on database and it gets completed in 89 minutes, which is really good.

The increase on performance is 25 %.
I know this is expected behavior in Vertica because its MPP, but is there any good reason(explanation) why we are not getting 50% increase in performance as we have doubled the nodes.

Thanks in Advance

[Deleted User] · April 2014

Hi Navin,

Another round of good questions :-) Thanks!

Regarding scaling -- 10 or 20 or 40 or more nodes are certainly doable with essentially linear scale-out. (I think 10 nodes is actually on the small side for enterprise Vertica deployments today; though I should let actual customers here speak to their own experiences and setups.)

Vertica aims to provide linear scale-out to arbitrarily large numbers of nodes, defined very specifically as "if you double your cluster size and double your data size, queries take the same amount of time to execute."

There's actually an important detail about your cluster that you haven't included, and that's the network design. Vertica can usually use local data when executing queries, so network tends not to be a huge bottleneck. But in order to have nicely-segmented data for queries, we have to distribute/segment data while loading it. Clusters with hundreds of servers across many racks and with very large data-ingest rates can hit cross-sectional network-bandwidth limits unless they have a fast network core. This is the scaling limit that I see most often in practice.

Vertica's theoretical scaling limit is how fast it can send around control messages to get queries running. We typically use Broadcast UDP for these messages, for efficiency. In the past, larger clusters with slow or lossy UDP traffic on their network (happens in some cloud or hosted environments) did have serious scaling limitations. Vertica's Large Cluster functionality can generally be used to clear up these issues. One more thing to configure, though.

That's your first question. To answer your second -- why not linear performance improvements? There can be lots of reasons for this; it depends on the query, data, and projection design. Conceptually, though, some algorithms aren't linear. For example, we store our data in sorted order, so we can find specific values by doing a binary search. If you're familiar with binary search, you'll know that it's algorithmically very fast (O(log n)); it only gets a little bit slower if you double your data-set size. Of course, on the flip side, if you cut the data size in half, it doesn't take half the time; it only gets a little bit faster.

Anyway, hope that helps,

Adam

Navin_C · April 2014

Thanks Adam,

I really appreciate your answer on this.

for first question

Data is segmented perfectly across all nodes, Also we have just changed our partition from weekly partition to monthly partition

so can we say this is the thumb rule in Vertica

if you double your cluster size and double your data size, queries take the same amount of time to execute."

For second question:
Does vertica always do a binary search on its data, or there are any other algorithms which vertica tries to follow, depending on the conditions

[Deleted User] · April 2014

Hi Navin,

Glad my answer was helpful!

For the first question, certainly for big queries on properly-segmented data, that's what we aim for. If you ever find that it's not the case, let us know / file a bug / etc.

(It'd be great to hear from actual users on this -- we can run all the tests that we want in the lab; but how do we perform in the field?)

For the second question, Vertica has all kinds of different algorithms. (We're a database; that's our job, take your queries and know the best algorithms to run them with.) Because of our whole data-is-always-sorted thing, binary search is one of the most common algorithms that we use that other databases tend to not use as often. If you want to get a sense of the various algorithms that are commonly used in databases in general, I'd recommend a good course or textbook covering database internals. (As distinct from how to use databases -- the two are typically taught very differently.)

One more example: Binary search scales better than linearly (O(log n), where O(n) is linear) as data grows (so worse than linearly as data-per-node shrinks / cluster size grows). Sort goes the other way: Simple sorting can theoretically never be better than O(n*log n). Ours is more complicated, but it can still take more than twice as long to sort twice the data.

Adam

We're Moving!

Create My New Community Account Now

Performance at Scale

Comments

Leave a Comment