how to fend off against EC2 failures / EBS corruptions
For folks running on AWS / EC2, from time to time a node fails due to a transient communication error with EBS. This may also leave some corrupt data files and vertica process requests the the force cleanup restart.
(1) Is to advisable to change the vertica daemon script that start vertica upon a host reboot to always invoke the force flag ( "--force" ) so that force restart is commonly used ? Can --force node restart be harmful ?
(2) Is it possible to detect the node down and trigger a host restart ( and vertica process restart with force data file cleanup as per point above ) ?
any recommendations on how to provide a more resilient system on AWS ?
Thank you.
0
Answers
Running Vertica Enterprise on AWS/EC2 with EBS for 5 years I haven't seen any corruptions.
Sometimes a node leaves the cluster due to AWS interruptions like very slow EBS volumes.
Those can be seen in Linux dmesg (eg 120 seconds timeout on /dev/xvda4).
You can gain resilience by using a hot standby node automatically kicking in if a node is lost for eg 1 hour.