Different Disk Size (utilized) in cluster

rahul_dhar · July 2013

I have data repository installed as cluster with 3 nodes. If i check the output of df -h on each node, the disk usage figures are different on each node? On node 1 it shows 49 GB, node 2 - 47 GB, node 3 - 48 GB. Is it normal, if yes whats the reason for different disk usage figures?

[Deleted User] · July 2013

Hi Rahul, Yes, it is normal to have a small amount of variation in disk usage between nodes. Vertica doesn't typically keep exactly the same data on each node. (It can't -- part part of the purpose of Vertica as a cluster database is to store data that's too big to fit on any one computer.) You can customize how your tables are split up, but the default and most common split is hash segmentation, which hashes each value and uses the result to figure out what node to store it on. Hash segmentation makes two useful guarantees: - Equal values are guaranteed to be on the same node, and on a node that can be predicted if you know the value - Distinct values will tend to be mostly uniformly distributed among nodes (You can't totally guarantee both predictable distribution and uniform distribution -- if we guarantee predictable distribution, then I could take a bunch of values and give Vertica just the ones that belong to node 1; then the storage would be skewed. Hashing just makes it really hard to do this accidentally. To understand how it does this, read up on the math behind hashing.) Note that hashing doesn't distribute duplicate values. So if, for example, you're segmenting on a column that has a whole bunch of 0's, all of those 0's can end up on the same node, and may use more disk space. This is called "data skew"; if you know your data has a lot of it, you should think about modifying the segmentation clause for your table to avoid the skew. Hashing also doesn't take into account Vertica's compression: Some sequences of values compress a little better than others, so may use a little less disk space. Over millions or billions or more rows, all of these effects average out. But there can still be small differences. Also: If one node is in the middle of a mergeout (ie., updating its storage), it can temporarily be using significantly more space -- it first merges a bunch of ROS files into a single file, then deletes the original bunch of files. So that particular partition of a table briefly needs twice the storage. (Or less -- part of the reason for merges is that merged containers should be smaller.) Adam

We're Moving!

Create My New Community Account Now

Different Disk Size (utilized) in cluster

Comments

Leave a Comment