Sporadic cases when a node is down
Hi all,
I noticed that occasionally one of my 3 node cluster goes down and I can't find the reason for this issue (no any log hints or similar). Furthermore I've noticed that after the node is down, I'm not able to restart it from the management console and need to restart the vertica service to let him start.
Do you know the way, how can i find the reason for this sporadically problem?
Thanks
Konrad
0
Comments
Can you check for DatabaseHeartBeatInterval ?
select get_config_parameter('DatabaseHeartbeatInterval');
Determines the interval (in seconds) at which each node performs a health check and communicates a heartbeat. If a node does not receive a message within five times of the specified interval, the node is evicted from the cluster. Setting the interval to 0 disables the feature.
Try disabling and see if that helps.
select set_config_parameter('DatabaseHeartbeatInterval', 0);
Or you can also try increasing the interval.
select set_config_parameter('DatabaseHeartbeatInterval', 600);
Hi ,
Take a look inside the file called ErrorReport.txt, it sits in the catalog location along with the vertica.log.
Linux system overcommit memory so when Vertica is getting to much of it Linux will kill Vertica porccess(OOM killer) Out of memeory killer.
The error report wil caputre the state and query that was the villan in this case.
Also look at the vertica.log for critical,error,fatal string.
When you need to bring him back , do it with the admintools option, and remember that it takes time sometime becouse it has to rsync to all node before it goes up.
Thanks guys,
this is something I can start with
In the meanwhile I've created a simple script that every 2 minutes checks if vertica service is running (technically I'm checkin if spread process is available and if not I'm starting the verticad service ), but maybe the way via admintool is better, need to check it, but currently my solution is working, so I will try to test it a few days.
Greetings,
Konrad