how to fend off against EC2 failures / EBS corruptions

For folks running on AWS / EC2, from time to time a node fails due to a transient communication error with EBS. This may also leave some corrupt data files and vertica process requests the the force cleanup restart.

(1) Is to advisable to change the vertica daemon script that start vertica upon a host reboot to always invoke the force flag ( "--force" ) so that force restart is commonly used ? Can --force node restart be harmful ?

(2) Is it possible to detect the node down and trigger a host restart ( and vertica process restart with force data file cleanup as per point above ) ?

any recommendations on how to provide a more resilient system on AWS ?

Thank you.


  • Options
    baron42bbabaron42bba Vertica Customer
    edited July 2021

    Running Vertica Enterprise on AWS/EC2 with EBS for 5 years I haven't seen any corruptions.
    Sometimes a node leaves the cluster due to AWS interruptions like very slow EBS volumes.
    Those can be seen in Linux dmesg (eg 120 seconds timeout on /dev/xvda4).
    You can gain resilience by using a hot standby node automatically kicking in if a node is lost for eg 1 hour.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file