vertica node shutting down with no particular reason
Hello,
I have 3 nodes vertica (7.1.2) cluster. It stops after few hours constantly. I noticed that in most cases one of the node just stops unexpectedly. Last log messages on that node are:
...
2015-10-03 00:32:41.356 DistCall Dispatch:0x7f03e80021e0-a00000000a98d5 [Txn] <INFO> Starting Commit: Txn: a00000000a98d5 'SELECT ref, campaign_id, creative_id_ext, segment_id, SUM(imps), SUM(clicks), SUM(dwelltime), SUM(v_imps), SUM(start), SUM(midpoint), SUM(first_quartile), SUM(third_quartile), SUM(complete), COUNT(DISTINCT cookie) FROM CAMPAIGN_STAT WHERE campaign_id IN (8446) AND segment_id != '' GROUP BY ref, campaign_id, creative_id_ext, segment_id'
2015-10-03 00:32:41.356 DistCall Dispatch:0x7f03e80021e0 [Txn] <INFO> Commit Complete: Txn: a00000000a98d5 at epoch 0x6135
2015-10-03 00:33:00.074 SystemMonitor:0x7f03dc0153b0 <LOG> @v_admp_node0003: 00000/5075: Total Memory free + cache: 746979328
On other nodes I see following at that time:
...
2015-10-03 00:33:29.355 Init Session:0x7fe7d0011fe0-a00000000a98df [Session] <INFO> [Query] TX:a00000000a98df(ip-10-0-0-77.eu-wes-7531:0x16e78) COPY CAMPAIGN_STAT FROM LOCAL '/tmp/suff4096722844163351883pref' ENCLOSED BY '''' NO COMMIT
2015-10-03 00:33:34.436 EEThread:0x7fe7948ad110-a00000000a98c4 [EE] <WARNING> Recv: Message receipt from v_admp_node0003 failed [canceled] handle=MultiplexedRecvHandle (0x7fe82c007ed0) (10.0.0.149:5434)tag 1001 cancelId 0xa00000000a98c5 CANCELED
2015-10-03 00:33:34.444 Spread Client:0x90683a0 [Comms] <INFO> Saw membership message 5120 on V:admp
2015-10-03 00:33:34.444 Spread Client:0x90683a0 [Comms] <INFO> DB Group changed
2015-10-03 00:33:34.448 EEThread:0x7fe7943d8dd0-a00000000a98c4 [EE] <WARNING> Recv: Message receipt from v_admp_node0002 failed [canceled] handle=MultiplexedRecvHandle (0x7fe82c008540) (10.0.0.247:5434)tag 1001 cancelId 0xa00000000a98c5 CANCELED
2015-10-03 00:33:34.448 Spread Client:0x90683a0 [VMPI] <INFO> DistCall: Set current group members called with 2 members
2015-10-03 00:33:34.448 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16d2d
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state EXECUTING(3): a00000000a98c5
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state EXECUTING(3): a00000000a98c5
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e3d
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e42
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e48
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e57
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e78
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Comms] <INFO> nodeSetNotifier: node v_admp_node0003 left the cluster
2015-10-03 00:33:34.607 Spread Client:0x90683a0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2015-10-03 00:33:34.668 Spread Client:0x90683a0 [Recover] <INFO> Checking Deps:Down bits: 100 Deps:
001 - cnt: 2
010 - cnt: 2
100 - cnt: 2
2015-10-03 00:33:34.668 Spread Client:0x90683a0 [Recover] <INFO> Setting node v_admp_node0001 to UNSAFE
...
Could you please help me with resolving this issue.
0
Comments
Hi
From the first look it seems to be low memory related issue that might have caused this crash.Request you to kindly provide the "/var/log/messages" output along with the complete vertica.log file.
Also wanted to know how much RAM you have on each node in the cluster.
Regards
Rahul Choudhary
Hi
You can find logs (messages and vertica.log both) here:
https://dl.dropboxusercontent.com/u/17529815/vertica/node1.tar.gz
https://dl.dropboxusercontent.com/u/17529815/vertica/node2.tar.gz
https://dl.dropboxusercontent.com/u/17529815/vertica/node3.tar.gz
Regarding memory. I am monitoring free memory and see what was memory info 3-4 minutes before failure:
node1:
node3 (the one that failed):
As you see every vertica node has a little bit less than 15Gb of RAM
Go over your /var/log/messages on the node/host where Vertica keeps going down.
If you get any output then your system run out of memory and decided that Vertica should be sacrificed , do save the day.
Normally in the log you will get a report on the stauts of all process before the OOM killer is invoked to kill the proccess based on the highest oom_score_adj score.
At this point looking in your vertica.log file will not help you much.
You need ot make sure that you don`t run out of resources in the node or limit the memory ussage to a 95 % lets say or make sure you at least have plenty swap.
Hi,
I've checked messages log and found "Out of memory" indeed. So now I know what was the reason. Thank you for the hint about swap.