UNSEGMENTED FACT tables

eli_revach · July 2013

Hi , I would like to know about cluster best practices when my cluster include 2 nodes only and each projection are fully mirrors on each one of the node , like the way we do it for dimensions tables (UNSEGMENTED ) to do it for the FACT tables . What are the risk ? any customer that use such implementation ? . The idea behind that is to minimize the amount of server and maintenance .

Nimmi_gupta · July 2013

Hi Eli, It depends on the size of the tables you have. What's the size of the tables? As per doc, Database Designer does not segment projections for small tables; rather it replicates them, creating and storing duplicates of these projections on all nodes within the database. Replication ensures: Distributed query execution across multiple nodes. High availability and recovery. In a K-safe database, replicated projections serve as buddy projections. This means that a replicated projection on any node can be used for recovery. Thanks Nimmi

eli_revach · July 2013

Thanks for the replay , I am asking about FACT tables , in most cases they include a lot of data (very Big tables) , i like to know what is the best practices of FACT tables replication ( i like to reduce the amount of servers on my cluster 2 instead of 3 ) , did anyone implement such scenario and what was the impact . About Database Designer , if he will not do the job i can create it manually . Thanks

Nimmi_gupta · July 2013

Hi eli, To make database k-safe HP vertica recommend to have at least 3 node clusters. As you are saying table has lots of data, you may need to re-think about projection type (unsegmented or segmented projection)? As per document, unsegmented projection is used only for small table. The impact would be the performance. Yes you can create it manually too. Please let me know if you have any questions. Thanks Nimmi

eli_revach · July 2013

Hi which type of performece issues ? In addtion do you know abount customers that use this type of imlementation

Itipong_Chewinp · July 2013

Hi eli revach, I think no one would like to implement architecture like that. but you can try it. And let us know the result. The performance is about MPP architecture. Mean "If you have more node the performance will better for the same volumn of data". Let say you have a SQL statement that take ~10 minutes to execute on one node. It will take ~5 minutes on 2 nodes (assume that data was segments to all node). and it will take ~2.5 minutes on 4 nodes.

eli_revach · July 2013

The SQL response time will never be exact linear to the amount of servers . My point is that for small implementations to use minimum servers to reduce maintenance and cost , small shops with less traffic can probably benefit from it . And I think such implementation can very useful for them . But I am looking for feedback if someone consider it / use it

[Deleted User] · July 2013

Thanks all for the thoughts! Just to highlight one of them: I do think Nimmi's K-safety point is important to consider: Even if you replicate the data between the two servers, if one node goes down, the whole cluster goes down, and you risk losing data. This is because of network segmentation -- how can one Vertica node tell the difference between "the other node went down" and "the network between the two nodes was cut off"? In the latter case, the outside world might still see both servers, and people might try to keep using (and updating) them independently... Basically, bad things happen. So we don't allow that -- a cluster must always be able to communicate with more than half of its nodes; otherwise we shut down as a safety precaution. I'd think about it this way: Can you use 3 cheaper computers instead, for the same price? We probably load-balance better in that case anyway; make better use of the hardware. Or could you consolidate to one node? That might actually be more reliable than 2 nodes -- either we're up or we're down. You can always pay a little more or a little less for each computer, so the exact number of nodes is fungible. We do have people who run 2-node clusters. (Hopefully they'll speak up here, comment on their experiences.) I know they have trouble recovering from server crashes.

eli_revach · September 2013

Adam , In case of 2-node clusters and UNSEGMENTED fact tables , did the optimizer will create parallel query plan and execution , or i should expect to see serial execution , so my cluster will serve mostly for scaling the connection and protection from media failed .

[Deleted User] · September 2013

Hi Eli, Generally speaking, Vertica uses segmentation to parallelize. So if you don't have segmentation, you don't typically get parallel execution. There are some exceptions and some ways around this. (For example, if you're running lots of queries, you could explicitly have some run against node 1 and others against node 2.) But it's not something that we currently optimize for. Adam

[Deleted User] · September 2013

(I should note -- in some cases this is an advantage: Planning and setting up a distributed query takes time. With "serial execution", you will typically specifically see that Vertica avoids running a distributed query when possible, and instead just fetches the answer locally. For very short queries, you can sometimes make up in setup time more than you lose in execution time.)

UNSEGMENTED FACT tables

Comments

Leave a Comment