Sharding in Eon Mode
Many people have asked how sharding works in Eon mode. Let's see if I can make it more clear.
First, if you don't want to know about sharding, you don't need to understand it to effectively use the database. Create tables, projections, load data, run queries - all of this you can do without understanding sharding. You can even apply all your regular Vertica knowledge about when to use segmented vs unsegmented projections, how to co-segment projections for improved join performance, and so on.
Sharding becomes important for understanding and planning for fault tolerance and cluster elasticity. Shards come in two flavors: a replica shard for unsegmented data, and a set of S segment shards for segmented data. Each node subscribes to a set of shards, identifying the subset of the data that the node can serve for queries or write during loads. Node subscriptions can change when nodes come back up after failure, the cluster changes size, or you manually run 'select rebalance_shards();'. Subscription changes are fully online - they can happen with nodes down. The key tables for shards and subscriptions are:
select * from shards;
select * from node_subscriptions;
How does a row get mapped to a given segment shard? Well, one way to think of it is that Vertica has always had sharding, but it was specific to the projection being loaded. Each row would get hashed according to the projection segmentation clause, and the output hash value looked up in the projection's segmentation layout. Each node has a contiguous region of the hash space, so the result is which node the row should end up on. Each projection can have it's own segmentation layout (think mid-rebalance_cluster(), 1/2 the projections could be on 3 nodes and the other on 12 nodes), but in practice almost every projection has exactly the same layout.
The big difference with Eon mode is there is a global segmentation layout for all segmented projections as described by the segment shards. Each shard is responsible for a contiguous region of the hash space. To run a query, we need a node to serve each shard, we can determine eligible nodes by looking at the node subscriptions. For a given session, we pick a node to serve each shard for queries - we call these "participating nodes" for the session. Loads use the same mechanism to determine which nodes will have "DataTarget" nodes to do writes for the tables. The key table here is:
select * from session_subscriptions where is_participating;
Another difference is that in enterprise mode, each node only knew about its own ROS file metadata, whereas in Eon mode, any node subscribed to a shard knows about all the ROS file metadata for the shard. You can see this by looking at the storage_containers, and noting that multiple nodes will show the same storage_id, whereas in enterprise you will never see this. Also note the new column "shard_name" that tells you which shard the container is associated with.
Hope this helps you understand sharding in Vertica's Eon mode.
-Ben
Vertica CTO
Here's a link to our documentation on the topic:https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/Eon/EonConcepts/ShardsAndSubscriptions.htm%3FTocPath%3DUsing%2520Eon%2520Mode%2520Beta%7CEon%2520Database%2520Overview%7C_____2