Vertica crashed after few days working.
Vertica died day after day, what can be a problem?
How i can investigate problem?
Last log from spread before crash:
[Thu 15 Dec 2016 01:49:48] G_handle_reg_memb: with (10.0.0.58, 1481755789) id
[Thu 15 Dec 2016 01:49:48] G_handle_reg_memb in GTRANS
[Thu 15 Dec 2016 01:49:48] Sess_enable_heartbeats: explict = 1, thresh = 0, heartbeats_on = 0
[Thu 15 Dec 2016 01:49:48] G_handle_kill: #node_a#N010000000058 is killed
[Thu 15 Dec 2016 01:49:48] G_handle_kill in GOP
[Thu 15 Dec 2016 01:49:48] Daemon idle, exiting
Exit caused by Alarm(EXIT)
[Thu 15 Dec 2016 01:49:48] Sess: unlinked domain socket file /tmp/4803; ret=0
Dblog:
Connecting to spread at 4803
Connected to spread on local domain socket /tmp/4803
auto restart closing socket
Starting UDxSideProcess for language C++
with command line: /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0002-68312:0x2 debug-log-off /home/dbadmin/event/v_event_node0002_catalog/UDxLogs 5
12/13/16 12:16:36 SP_connect: unable to connect via UNIX socket to /tmp/4803 (pid=68312): Error: No such file or directory
12/13/16 12:16:36 SP_connect: DEBUG: Auth list is: NULL
12/13/16 12:16:36 SP_connect: connected with private group(21 bytes): #node_a#N010000000058; mbox=13, pid=68312
12/14/16 19:49:31 SP_disconnect: mbox=13, pid=68312, send_group=#node_a#N010000000058
+ using vkafka connector.
Comments
K-safe = 1
Can you restart the DB?
Is it going up OK?
Can you connect with ssh to all nodes?
After restart - everything fine.
Yes, i can connect throw nodes.
So, is it solved?
No, vertica will die after 1 or 2 day solid work.
Will check logs after crash.
Maybe exist solution to restart nodes automaticly ?
Did you check of any network interruptions during night? What I have seen is that latency is very important for Vertica
Will check.
latency - ok,very good idea.
We have 10g switch and same network card.
I will measure latency and drop packets if they really exist.
/opt/vertica/bin/adminTools -t view_cluster -x
DB | Host | State
-------+-----------+-------
mdb | 10.1.0.1 | DOWN
mdb | 10.1.0.2 | DOWN
mdb | 10.1.0.3 | DOWN
And again...
Do you create /tmp/4803 ???
Files in tmp might cause issues!
No,i don't create spread file in /tmp by myself.
ll /tmp/
total 20
drwxrwxrwt 5 root root 4096 Dec 20 12:20 ./
drwxr-xr-x 22 root root 4096 Sep 28 00:23 ../
drwxrwxrwt 2 root root 4096 Dec 16 01:01 .ICE-unix/
drwx------ 2 root root 4096 Dec 18 02:27 mc-root/
empty here.
Node fail again.
I have this log in spread.log
this i have in dblog.
Latency is stable.
Network connection is stable too.
Is the disk space OK?? MAybe the swap file takes all disk
KiB Swap: 31249404 total, 0 used, 31249404 free. 24408440 cached Mem
/dev/sda3 1.8T 23G 1.7T 2% /home/dbadmin
/dev/sda2 197G 12G 176G 7% /
so we haven't problem free space.
OK, back to teh basics,
What OS do you use?
Which Vertica Version?
Do you have 2 seperate networks, one for backend and one for clients?
Can you delete the db and recreate it? Although I think that the cluster is going down for another reason
Yes, i have public and private network.
From public network i recieve kafka stream.
Private network used only for cluster communication.
Proposal,
If you are above Vertica 7.1 have just 1 network! It will be fine!