vertica node shutting down with no particular reason

Alex_Radzishevs · October 2015

Hello,

I have 3 nodes vertica (7.1.2) cluster. It stops after few hours constantly. I noticed that in most cases one of the node just stops unexpectedly. Last log messages on that node are:

...
2015-10-03 00:32:41.356 DistCall Dispatch:0x7f03e80021e0-a00000000a98d5 [Txn] <INFO> Starting Commit: Txn: a00000000a98d5 'SELECT ref, campaign_id, creative_id_ext, segment_id, SUM(imps), SUM(clicks), SUM(dwelltime), SUM(v_imps), SUM(start), SUM(midpoint), SUM(first_quartile), SUM(third_quartile), SUM(complete), COUNT(DISTINCT cookie) FROM CAMPAIGN_STAT WHERE campaign_id IN (8446) AND segment_id != '' GROUP BY ref, campaign_id, creative_id_ext, segment_id'
2015-10-03 00:32:41.356 DistCall Dispatch:0x7f03e80021e0 [Txn] <INFO> Commit Complete: Txn: a00000000a98d5 at epoch 0x6135
2015-10-03 00:33:00.074 SystemMonitor:0x7f03dc0153b0 <LOG> @v_admp_node0003: 00000/5075: Total Memory free + cache: 746979328

On other nodes I see following at that time:

...
2015-10-03 00:33:29.355 Init Session:0x7fe7d0011fe0-a00000000a98df [Session] <INFO> [Query] TX:a00000000a98df(ip-10-0-0-77.eu-wes-7531:0x16e78) COPY CAMPAIGN_STAT FROM LOCAL '/tmp/suff4096722844163351883pref' ENCLOSED BY '''' NO COMMIT
2015-10-03 00:33:34.436 EEThread:0x7fe7948ad110-a00000000a98c4 [EE] <WARNING> Recv: Message receipt from v_admp_node0003 failed [canceled] handle=MultiplexedRecvHandle (0x7fe82c007ed0) (10.0.0.149:5434)tag 1001 cancelId 0xa00000000a98c5 CANCELED
2015-10-03 00:33:34.444 Spread Client:0x90683a0 [Comms] <INFO> Saw membership message 5120 on V:admp
2015-10-03 00:33:34.444 Spread Client:0x90683a0 [Comms] <INFO> DB Group changed
2015-10-03 00:33:34.448 EEThread:0x7fe7943d8dd0-a00000000a98c4 [EE] <WARNING> Recv: Message receipt from v_admp_node0002 failed [canceled] handle=MultiplexedRecvHandle (0x7fe82c008540) (10.0.0.247:5434)tag 1001 cancelId 0xa00000000a98c5 CANCELED
2015-10-03 00:33:34.448 Spread Client:0x90683a0 [VMPI] <INFO> DistCall: Set current group members called with 2 members
2015-10-03 00:33:34.448 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16d2d
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state EXECUTING(3): a00000000a98c5
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state EXECUTING(3): a00000000a98c5
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e3d
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceling state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Dist] <INFO> Dist::cancelPlan: canceled state NO EXECUTION(0): 0
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e42
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e48
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e57
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [VMPI] <INFO> Removing 45035996273838762 from list of initialized nodes for session ip-10-0-0-77.eu-wes-7531:0x16e78
2015-10-03 00:33:34.449 Spread Client:0x90683a0 [Comms] <INFO> nodeSetNotifier: node v_admp_node0003 left the cluster
2015-10-03 00:33:34.607 Spread Client:0x90683a0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2015-10-03 00:33:34.668 Spread Client:0x90683a0 [Recover] <INFO> Checking Deps:Down bits: 100 Deps:
001 - cnt: 2
010 - cnt: 2
100 - cnt: 2
2015-10-03 00:33:34.668 Spread Client:0x90683a0 [Recover] <INFO> Setting node v_admp_node0001 to UNSAFE
...

Could you please help me with resolving this issue.

Rahul_Choudhary · October 2015

Hi

From the first look it seems to be low memory related issue that might have caused this crash.Request you to kindly provide the "/var/log/messages" output along with the complete vertica.log file.

Also wanted to know how much RAM you have on each node in the cluster.

Regards

Rahul Choudhary

Alex_Radzishevs · October 2015

Hi

You can find logs (messages and vertica.log both) here:

https://dl.dropboxusercontent.com/u/17529815/vertica/node1.tar.gz

https://dl.dropboxusercontent.com/u/17529815/vertica/node2.tar.gz

https://dl.dropboxusercontent.com/u/17529815/vertica/node3.tar.gz

Regarding memory. I am monitoring free memory and see what was memory info 3-4 minutes before failure:

node1:

              total        used        free      shared  buff/cache   available
Mem:          14622        7040        5544          26        2037        7320
Swap:          2047         251        1796

node3 (the one that failed):

              total        used        free      shared  buff/cache   available
Mem:          14622        7082        7411           9         128        7379
Swap:          2047         244        1803

As you see every vertica node has a little bit less than 15Gb of RAM

Adrian_Oprea_1 · October 2015

Go over your /var/log/messages on the node/host where Vertica keeps going down.

cat /var/log/messages| grep "Out of memory"

If you get any output then your system run out of memory and decided that Vertica should be sacrificed , do save the day.

Normally in the log you will get a report on the stauts of all process before the OOM killer is invoked to kill the proccess based on the highest oom_score_adj score.

At this point looking in your vertica.log file will not help you much.

You need ot make sure that you don`t run out of resources in the node or limit the memory ussage to a 95 % lets say or make sure you at least have plenty swap.

Alex_Radzishevs · October 2015

Hi,

I've checked messages log and found "Out of memory" indeed. So now I know what was the reason. Thank you for the hint about swap.

We're Moving!

Create My New Community Account Now

vertica node shutting down with no particular reason

Comments

Leave a Comment