Vertica crashed after few days working.

anton_io · December 2016

Vertica died day after day, what can be a problem?

How i can investigate problem?

Last log from spread before crash:

[Thu 15 Dec 2016 01:49:48] G_handle_reg_memb: with (10.0.0.58, 1481755789) id
[Thu 15 Dec 2016 01:49:48] G_handle_reg_memb in GTRANS
[Thu 15 Dec 2016 01:49:48] Sess_enable_heartbeats: explict = 1, thresh = 0, heartbeats_on = 0
[Thu 15 Dec 2016 01:49:48] G_handle_kill: #node_a#N010000000058 is killed
[Thu 15 Dec 2016 01:49:48] G_handle_kill in GOP
[Thu 15 Dec 2016 01:49:48] Daemon idle, exiting
Exit caused by Alarm(EXIT)
[Thu 15 Dec 2016 01:49:48] Sess: unlinked domain socket file /tmp/4803; ret=0

Dblog:

Connecting to spread at 4803
Connected to spread on local domain socket /tmp/4803
auto restart closing socket
Starting UDxSideProcess for language C++
with command line: /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0002-68312:0x2 debug-log-off /home/dbadmin/event/v_event_node0002_catalog/UDxLogs 5
12/13/16 12:16:36 SP_connect: unable to connect via UNIX socket to /tmp/4803 (pid=68312): Error: No such file or directory
12/13/16 12:16:36 SP_connect: DEBUG: Auth list is: NULL
12/13/16 12:16:36 SP_connect: connected with private group(21 bytes): #node_a#N010000000058; mbox=13, pid=68312
12/14/16 19:49:31 SP_disconnect: mbox=13, pid=68312, send_group=#node_a#N010000000058

+ using vkafka connector.

anton_io · December 2016

K-safe = 1

ckotsidimos · December 2016

Can you restart the DB?

Is it going up OK?

Can you connect with ssh to all nodes?

anton_io · December 2016

After restart - everything fine.

Yes, i can connect throw nodes.

ckotsidimos · December 2016

So, is it solved?

anton_io · December 2016

No, vertica will die after 1 or 2 day solid work.

Will check logs after crash.

Maybe exist solution to restart nodes automaticly ?

ckotsidimos · December 2016

Did you check of any network interruptions during night? What I have seen is that latency is very important for Vertica

anton_io · December 2016

Will check.

anton_io · December 2016

latency - ok,very good idea.

We have 10g switch and same network card.

I will measure latency and drop packets if they really exist.

anton_io · December 2016

/opt/vertica/bin/adminTools -t view_cluster -x

DB | Host | State

-------+-----------+-------

mdb | 10.1.0.1 | DOWN

mdb | 10.1.0.2 | DOWN

mdb | 10.1.0.3 | DOWN

And again...

ckotsidimos · December 2016

Do you create /tmp/4803 ???

Files in tmp might cause issues!

anton_io · December 2016

No,i don't create spread file in /tmp by myself.

ll /tmp/
total 20
drwxrwxrwt 5 root root 4096 Dec 20 12:20 ./
drwxr-xr-x 22 root root 4096 Sep 28 00:23 ../
drwxrwxrwt 2 root root 4096 Dec 16 01:01 .ICE-unix/
drwx------ 2 root root 4096 Dec 18 02:27 mc-root/

empty here.

Node fail again.

I have this log in spread.log

[Tue 20 Dec 2016 04:39:44] Sess_read: received a heartbeat on 'node_b' ( mailbox 9 )
[Tue 20 Dec 2016 04:39:44] Pushed eviction timeout back 600.000000s
[Tue 20 Dec 2016 04:40:38] Sess_read: failed receiving header on session 9: ret 0: error: No such file or directory
[Tue 20 Dec 2016 04:40:38] Sess_kill: killing session node_b ( mailbox 9 )
[Tue 20 Dec 2016 04:40:38] G_handle_kill: #node_b#N010000000059 is killed
[Tue 20 Dec 2016 04:40:38] G_handle_kill in GOP
[Tue 20 Dec 2016 04:40:38] Daemon idle, exiting
Exit caused by Alarm(EXIT)
[Tue 20 Dec 2016 04:40:38] Sess: unlinked domain socket file /tmp/4803; ret=0

anton_io · December 2016

Conf_load_conf_file: using file: /home/dbadmin/event/v_event_node0001_catalog/spread.conf
Setting active IP version to 0
Successfully configured Segment 0 [10.0.0.57]:4803 with 1 procs:
               N010000000057: 10.0.0.57
Successfully configured Segment 1 [10.0.0.58]:4803 with 1 procs:
               N010000000058: 10.0.0.58
Successfully configured Segment 2 [10.0.0.59]:4803 with 1 procs:
               N010000000059: 10.0.0.59
Connected to spread on local domain socket /tmp/4803
auto restart closing socket
Starting UDxSideProcess for language C++
   with command line:  /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0001-3528:0x2 debug-log-off /home/dbadmin/event/v_event_node0001_catalog/UDxLogs 5
Starting UDxSideProcess for language C++
   with command line:  /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0001-3528:0xe debug-log-off /home/dbadmin/event/v_event_node0001_catalog/UDxLogs 5
*** rdkafka_buf.h:365:rd_kafka_buf_write: assert: rkbuf->rkbuf_wof + len <= rkbuf->rkbuf_size ***
12/19/16 16:53:33 SP_connect: unable to connect via UNIX socket to /tmp/4803 (pid=3528): Error: No such file or directory
12/19/16 16:53:34 SP_connect: DEBUG: Auth list is: NULL
12/19/16 16:53:34 SP_connect: connected with private group(21 bytes): #node_b#N010000000059; mbox=13, pid=3528

this i have in dblog.

Latency is stable.

Network connection is stable too.

ckotsidimos · December 2016

Is the disk space OK?? MAybe the swap file takes all disk

anton_io · December 2016

KiB Swap: 31249404 total, 0 used, 31249404 free. 24408440 cached Mem

/dev/sda3 1.8T 23G 1.7T 2% /home/dbadmin

/dev/sda2 197G 12G 176G 7% /

so we haven't problem free space.

ckotsidimos · December 2016

OK, back to teh basics,

What OS do you use?

Which Vertica Version?

Do you have 2 seperate networks, one for backend and one for clients?

Can you delete the db and recreate it? Although I think that the cluster is going down for another reason

anton_io · December 2016

Yes, i have public and private network.

From public network i recieve kafka stream.

Private network used only for cluster communication.

ckotsidimos · December 2016

Proposal,

If you are above Vertica 7.1 have just 1 network! It will be fine!

We're Moving!

Create My New Community Account Now

Vertica crashed after few days working.

Comments

Leave a Comment