vertica node shutting down with no particular reason

Hello,

 

I have 3 nodes vertica (7.1.2) cluster. It stops after few hours constantly. I noticed that in most cases one of the node just stops unexpectedly. Last log messages on that node are:

 

...
2015-10-03 00:32:41.356 DistCall Dispatch:0x7f03e80021e0-a00000000a98d5 [Txn] <INFO> Starting Commit: Txn: a00000000a98d5 'SELECT ref, campaign_id, creative_id_ext, segment_id, SUM(imps), SUM(clicks), SUM(dwelltime), SUM(v_imps), SUM(start), SUM(midpoint), SUM(first_quartile), SUM(third_quartile), SUM(complete), COUNT(DISTINCT cookie) FROM CAMPAIGN_STAT WHERE campaign_id IN (8446) AND segment_id != '' GROUP BY ref, campaign_id, creative_id_ext, segment_id'
2015-10-03 00:32:41.356 DistCall Dispatch:0x7f03e80021e0 [Txn] <INFO> Commit Complete: Txn: a00000000a98d5 at epoch 0x6135
2015-10-03 00:33:00.074 SystemMonitor:0x7f03dc0153b0 <LOG> @v_admp_node0003: 00000/5075: Total Memory free + cache: 746979328

On other nodes I see following at that time:

 

...
2015-10-03 00:33:29.355 Init Session:0x7fe7d0011fe0-a00000000a98df [Session] <INFO> [Query] TX:a00000000a98df(ip-10-0-0-77.eu-wes-7531:0x16e78) COPY CAMPAIGN_STAT FROM LOCAL '/tmp/suff4096722844163351883pref' ENCLOSED BY '''' NO COMMIT
2015-10-03 00:33:34.436 EEThread:0x7fe7948ad110-a00000000a98c4 [EE] <WARNING> Recv: Message receipt from v_admp_node0003 failed [canceled] handle=MultiplexedRecvHandle (0x7fe82c007ed0) (10.0.0.149:5434)tag 1001 cancelId 0xa00000000a98c5 CANCELED
2015-10-03 00:33:34.444 Spread Client:0x90683a0 [Comms] <INFO> Saw membership message 5120 on V:admp
2015-10-03 00:33:34.444 Spread Client:0x90683a0 [Comms] <INFO> DB Group changed
2015-10-03 00:33:34.448 EEThread:0x7fe7943d8dd0-a00000000a98c4 [EE] <WARNING> Recv: Message receipt from v_admp_node0002 failed [canceled] handle=MultiplexedRecvHandle (0x7fe82c008540) (10.0.0.247:5434)tag 1001 cancelId 0xa00000000a98c5 CANCELED
2015-10-03 00:33:34.448 Spread Client:0x90683a0 [VMPI] <INFO> DistCall: Set current group members called with 2 members
2015-10-03 00:33:34.448 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16d2d
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state EXECUTING(3): a00000000a98c5
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state EXECUTING(3): a00000000a98c5
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e3d
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e42
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e48
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e57
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e78
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Comms] <INFO> nodeSetNotifier: node v_admp_node0003 left the cluster
2015-10-03 00:33:34.607 Spread Client:0x90683a0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2015-10-03 00:33:34.668 Spread Client:0x90683a0 [Recover] <INFO> Checking Deps:Down bits: 100 Deps:
001 - cnt: 2
010 - cnt: 2
100 - cnt: 2
2015-10-03 00:33:34.668 Spread Client:0x90683a0 [Recover] <INFO> Setting node v_admp_node0001 to UNSAFE
...

Could you please help me with resolving this issue.

Comments

  • Hi

     

    From the first look it seems to be low memory related issue that might have caused this crash.Request you to kindly provide the "/var/log/messages" output along with the complete vertica.log file.

     

    Also wanted to know how much RAM you have on each node in the cluster.

     

     

    Regards

    Rahul Choudhary

  • Hi

     

    You can find logs (messages and vertica.log both) here:

    https://dl.dropboxusercontent.com/u/17529815/vertica/node1.tar.gz

    https://dl.dropboxusercontent.com/u/17529815/vertica/node2.tar.gz

    https://dl.dropboxusercontent.com/u/17529815/vertica/node3.tar.gz

     

    Regarding memory. I am monitoring free memory and see what was memory info 3-4 minutes before failure:

    node1:

                  total        used        free      shared  buff/cache   available
    Mem: 14622 7040 5544 26 2037 7320
    Swap: 2047 251 1796

    node3 (the one that failed):

                  total        used        free      shared  buff/cache   available
    Mem: 14622 7082 7411 9 128 7379
    Swap: 2047 244 1803

    As you see every vertica node has a little bit less than 15Gb of RAM

  •  Go over your /var/log/messages on the node/host where Vertica keeps going down.

    cat /var/log/messages| grep "Out of memory"

     If you get any output then your system run out of memory and decided that Vertica should be sacrificed :), do save the day.

     Normally in the log you will get a report on the stauts of all process before the OOM killer is invoked to kill the proccess based on the highest oom_score_adj score.

     At this point looking in your vertica.log file will not help you much. 

    You need ot make sure that you don`t run out of resources in the node or limit the memory ussage to a 95 % lets say or make sure you at least have plenty swap.

     

     

  • Hi,

     

    I've checked messages log and found "Out of memory" indeed. So now I know what was the reason. Thank you for the hint about swap.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file