Schema Design Considerations
Hi All,
I'm tasked with migrating our MongoDB database into Vertica for our website analytics.
Presently we have one table (collection in Mongo parlance) per site, around 40,000 sites and approx 3 billion rows. The largest site has about 300 million rows. Total rows is growing at about 10% per month. We store a timestamp (int), timeslice (rounded timestamp, int) and 7 fields for each record (document).
Is it optimal to stick with the one table per site setup ?
Is there a more optimal way to set this up ?
(Just thinking about 40,000 tables with the same RLE data in them .. 5 of the fields have a cardinality of 200, the other two have a cardinality of 20,000 and 65,000).
I'm not looking for hard and fast advice that i can take to the bank (if you've got it that's excellent), more "I'd take route {x} for reason {y} and set things up like {z}" .. :-)
Primary use case : Customers querying their own data (on a per site basis).
Secondary use case : On occasion we query across the whole dataset for trends and stats.
Thanks.
Cheers
Richard
I'm tasked with migrating our MongoDB database into Vertica for our website analytics.
Presently we have one table (collection in Mongo parlance) per site, around 40,000 sites and approx 3 billion rows. The largest site has about 300 million rows. Total rows is growing at about 10% per month. We store a timestamp (int), timeslice (rounded timestamp, int) and 7 fields for each record (document).
Is it optimal to stick with the one table per site setup ?
Is there a more optimal way to set this up ?
(Just thinking about 40,000 tables with the same RLE data in them .. 5 of the fields have a cardinality of 200, the other two have a cardinality of 20,000 and 65,000).
I'm not looking for hard and fast advice that i can take to the bank (if you've got it that's excellent), more "I'd take route {x} for reason {y} and set things up like {z}" .. :-)
Primary use case : Customers querying their own data (on a per site basis).
Secondary use case : On occasion we query across the whole dataset for trends and stats.
Thanks.
Cheers
Richard
0
Comments
40K tables is not what i would suggest .
Vertica will have to maintain a big catalog for it and you will need to maintain 40K table and more than that projections .
I will go with buckets.
20% of your biggest customer deserve separate schema/table.
All the rest will get one table .
Partitioned by date for performance and retention .
If you doesn't have huge amount of distinct values, Vertica will compress the table pretty well.
Hope that helps.