We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


vertica node shutting down with no particular reason — Vertica Forum

vertica node shutting down with no particular reason

Hello,

 

I have 3 nodes vertica (7.1.2) cluster. It stops after few hours constantly. I noticed that in most cases one of the node just stops unexpectedly. Last log messages on that node are:

 

...
2015-10-03 00:32:41.356 DistCall Dispatch:0x7f03e80021e0-a00000000a98d5 [Txn] <INFO> Starting Commit: Txn: a00000000a98d5 'SELECT ref, campaign_id, creative_id_ext, segment_id, SUM(imps), SUM(clicks), SUM(dwelltime), SUM(v_imps), SUM(start), SUM(midpoint), SUM(first_quartile), SUM(third_quartile), SUM(complete), COUNT(DISTINCT cookie) FROM CAMPAIGN_STAT WHERE campaign_id IN (8446) AND segment_id != '' GROUP BY ref, campaign_id, creative_id_ext, segment_id'
2015-10-03 00:32:41.356 DistCall Dispatch:0x7f03e80021e0 [Txn] <INFO> Commit Complete: Txn: a00000000a98d5 at epoch 0x6135
2015-10-03 00:33:00.074 SystemMonitor:0x7f03dc0153b0 <LOG> @v_admp_node0003: 00000/5075: Total Memory free + cache: 746979328

On other nodes I see following at that time:

 

...
2015-10-03 00:33:29.355 Init Session:0x7fe7d0011fe0-a00000000a98df [Session] <INFO> [Query] TX:a00000000a98df(ip-10-0-0-77.eu-wes-7531:0x16e78) COPY CAMPAIGN_STAT FROM LOCAL '/tmp/suff4096722844163351883pref' ENCLOSED BY '''' NO COMMIT
2015-10-03 00:33:34.436 EEThread:0x7fe7948ad110-a00000000a98c4 [EE] <WARNING> Recv: Message receipt from v_admp_node0003 failed [canceled] handle=MultiplexedRecvHandle (0x7fe82c007ed0) (10.0.0.149:5434)tag 1001 cancelId 0xa00000000a98c5 CANCELED
2015-10-03 00:33:34.444 Spread Client:0x90683a0 [Comms] <INFO> Saw membership message 5120 on V:admp
2015-10-03 00:33:34.444 Spread Client:0x90683a0 [Comms] <INFO> DB Group changed
2015-10-03 00:33:34.448 EEThread:0x7fe7943d8dd0-a00000000a98c4 [EE] <WARNING> Recv: Message receipt from v_admp_node0002 failed [canceled] handle=MultiplexedRecvHandle (0x7fe82c008540) (10.0.0.247:5434)tag 1001 cancelId 0xa00000000a98c5 CANCELED
2015-10-03 00:33:34.448 Spread Client:0x90683a0 [VMPI] <INFO> DistCall: Set current group members called with 2 members
2015-10-03 00:33:34.448 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16d2d
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state EXECUTING(3): a00000000a98c5
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state EXECUTING(3): a00000000a98c5
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e3d
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e42
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e48
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e57
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e78
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Comms] <INFO> nodeSetNotifier: node v_admp_node0003 left the cluster
2015-10-03 00:33:34.607 Spread Client:0x90683a0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2015-10-03 00:33:34.668 Spread Client:0x90683a0 [Recover] <INFO> Checking Deps:Down bits: 100 Deps:
001 - cnt: 2
010 - cnt: 2
100 - cnt: 2
2015-10-03 00:33:34.668 Spread Client:0x90683a0 [Recover] <INFO> Setting node v_admp_node0001 to UNSAFE
...

Could you please help me with resolving this issue.

Comments

  • Hi

     

    From the first look it seems to be low memory related issue that might have caused this crash.Request you to kindly provide the "/var/log/messages" output along with the complete vertica.log file.

     

    Also wanted to know how much RAM you have on each node in the cluster.

     

     

    Regards

    Rahul Choudhary

  • Hi

     

    You can find logs (messages and vertica.log both) here:

    https://dl.dropboxusercontent.com/u/17529815/vertica/node1.tar.gz

    https://dl.dropboxusercontent.com/u/17529815/vertica/node2.tar.gz

    https://dl.dropboxusercontent.com/u/17529815/vertica/node3.tar.gz

     

    Regarding memory. I am monitoring free memory and see what was memory info 3-4 minutes before failure:

    node1:

                  total        used        free      shared  buff/cache   available
    Mem: 14622 7040 5544 26 2037 7320
    Swap: 2047 251 1796

    node3 (the one that failed):

                  total        used        free      shared  buff/cache   available
    Mem: 14622 7082 7411 9 128 7379
    Swap: 2047 244 1803

    As you see every vertica node has a little bit less than 15Gb of RAM

  •  Go over your /var/log/messages on the node/host where Vertica keeps going down.

    cat /var/log/messages| grep "Out of memory"

     If you get any output then your system run out of memory and decided that Vertica should be sacrificed :), do save the day.

     Normally in the log you will get a report on the stauts of all process before the OOM killer is invoked to kill the proccess based on the highest oom_score_adj score.

     At this point looking in your vertica.log file will not help you much. 

    You need ot make sure that you don`t run out of resources in the node or limit the memory ussage to a 95 % lets say or make sure you at least have plenty swap.

     

     

  • Hi,

     

    I've checked messages log and found "Out of memory" indeed. So now I know what was the reason. Thank you for the hint about swap.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file