What should my depot size be?
If I do have cluster with 10 nodes and data size of 50TB what should be depot size ?
Best Answer
-
dsprogis Employee
You should size your depot to the actively queried data set (aka "hot data"). So, while you may have a 50T database, you might be actively querying 2T or 5T.
Imagine your 50T represents 2 years of IoT log data. Imagine further that your application is watching IoT sensors looking for bad patterns or optimizing controls for good patterns. Then, you might be looking at the last 7 days of data. Thus, your active data set is 7 days * (50T/(2 years * 365 days/year)) = 480G. Regardless of the numbed of nodes, this is the portion of the database you would want to cache in depot. 480G is easily managed by a subcluster of 3 nodes, each node having 160G dedicated to depot.
With your other 7 nodes you might consider dedicating 3 nodes to ETL which might need more depot or less depot depending on the complexity of your transformations and possible dimensional or history data needed in the process.
You've got four nodes left which you might use for ML training to predict the bad patterns or optimizations mentioned earlier. ML training will probably look back in time for patterns over weeks if not months. Thus, you would want to calculate the depot accordingly. You would likely want to use beefier nodes with more memory and be sure to allocate sufficient TEMP space too. And, because you are not likely training continuously, you can shut these nodes down when not in use.
Lastly, you might be wondering about a shard count that could support subclusters of 3 nodes and 4 nodes. A shard count of 12 could do this, each node in a 3-node subcluster would subscribed to 4 shards and each node in a 4-node subcluster would subscribed to 3 shards. If you are worried about growth and supporting larger clusters, you might double the shard count to 24.5