Vertica crashed after few days working.

Vertica died day after day, what can be a problem?

How i can investigate problem?

Last log from spread before crash:

 

[Thu 15 Dec 2016 01:49:48] G_handle_reg_memb: with (10.0.0.58, 1481755789) id
[Thu 15 Dec 2016 01:49:48] G_handle_reg_memb in GTRANS
[Thu 15 Dec 2016 01:49:48] Sess_enable_heartbeats: explict = 1, thresh = 0, heartbeats_on = 0
[Thu 15 Dec 2016 01:49:48] G_handle_kill: #node_a#N010000000058 is killed
[Thu 15 Dec 2016 01:49:48] G_handle_kill in GOP
[Thu 15 Dec 2016 01:49:48] Daemon idle, exiting
Exit caused by Alarm(EXIT)
[Thu 15 Dec 2016 01:49:48] Sess: unlinked domain socket file /tmp/4803; ret=0

 

Dblog:

Connecting to spread at 4803
Connected to spread on local domain socket /tmp/4803
auto restart closing socket
Starting UDxSideProcess for language C++
with command line: /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0002-68312:0x2 debug-log-off /home/dbadmin/event/v_event_node0002_catalog/UDxLogs 5
12/13/16 12:16:36 SP_connect: unable to connect via UNIX socket to /tmp/4803 (pid=68312): Error: No such file or directory
12/13/16 12:16:36 SP_connect: DEBUG: Auth list is: NULL
12/13/16 12:16:36 SP_connect: connected with private group(21 bytes): #node_a#N010000000058; mbox=13, pid=68312
12/14/16 19:49:31 SP_disconnect: mbox=13, pid=68312, send_group=#node_a#N010000000058

 

+ using vkafka connector.

Comments

  • K-safe = 1

  • Can you restart the DB?

    Is it going up OK?

     

    Can you connect with ssh to all nodes?

  • After restart - everything fine.

    Yes, i can connect throw nodes.

  • So, is it solved? 

  • No, vertica will die after 1 or 2 day solid work.

    Will check logs after crash.

    Maybe exist solution to restart nodes automaticly ?

  • Did you check of any network interruptions during night? What I have seen is that latency is very important for Vertica

  • Will check.

  • latency - ok,very good idea.

    We have 10g switch and same network card.

    I will measure latency and drop packets if they really exist.

  • /opt/vertica/bin/adminTools -t view_cluster -x

    DB | Host | State

    -------+-----------+-------

    mdb | 10.1.0.1 | DOWN

    mdb | 10.1.0.2 | DOWN

    mdb | 10.1.0.3 | DOWN

    And again...

  • Do you create /tmp/4803 ???

     

    Files in tmp might cause issues! 

  • No,i don't create spread file in /tmp by myself.

    ll /tmp/
    total 20
    drwxrwxrwt 5 root root 4096 Dec 20 12:20 ./
    drwxr-xr-x 22 root root 4096 Sep 28 00:23 ../
    drwxrwxrwt 2 root root 4096 Dec 16 01:01 .ICE-unix/
    drwx------ 2 root root 4096 Dec 18 02:27 mc-root/

    empty here.

    Node fail again.

     

    I have this log in spread.log

     

    [Tue 20 Dec 2016 04:39:44] Sess_read: received a heartbeat on 'node_b' ( mailbox 9 )
    [Tue 20 Dec 2016 04:39:44] Pushed eviction timeout back 600.000000s
    [Tue 20 Dec 2016 04:40:38] Sess_read: failed receiving header on session 9: ret 0: error: No such file or directory
    [Tue 20 Dec 2016 04:40:38] Sess_kill: killing session node_b ( mailbox 9 )
    [Tue 20 Dec 2016 04:40:38] G_handle_kill: #node_b#N010000000059 is killed
    [Tue 20 Dec 2016 04:40:38] G_handle_kill in GOP
    [Tue 20 Dec 2016 04:40:38] Daemon idle, exiting
    Exit caused by Alarm(EXIT)
    [Tue 20 Dec 2016 04:40:38] Sess: unlinked domain socket file /tmp/4803; ret=0

     

  • Conf_load_conf_file: using file: /home/dbadmin/event/v_event_node0001_catalog/spread.conf
    Setting active IP version to 0
    Successfully configured Segment 0 [10.0.0.57]:4803 with 1 procs:
    N010000000057: 10.0.0.57
    Successfully configured Segment 1 [10.0.0.58]:4803 with 1 procs:
    N010000000058: 10.0.0.58
    Successfully configured Segment 2 [10.0.0.59]:4803 with 1 procs:
    N010000000059: 10.0.0.59
    Connected to spread on local domain socket /tmp/4803
    auto restart closing socket
    Starting UDxSideProcess for language C++
    with command line: /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0001-3528:0x2 debug-log-off /home/dbadmin/event/v_event_node0001_catalog/UDxLogs 5
    Starting UDxSideProcess for language C++
    with command line: /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0001-3528:0xe debug-log-off /home/dbadmin/event/v_event_node0001_catalog/UDxLogs 5
    *** rdkafka_buf.h:365:rd_kafka_buf_write: assert: rkbuf->rkbuf_wof + len <= rkbuf->rkbuf_size ***
    12/19/16 16:53:33 SP_connect: unable to connect via UNIX socket to /tmp/4803 (pid=3528): Error: No such file or directory
    12/19/16 16:53:34 SP_connect: DEBUG: Auth list is: NULL
    12/19/16 16:53:34 SP_connect: connected with private group(21 bytes): #node_b#N010000000059; mbox=13, pid=3528

    this i have in dblog.

     

    Latency is stable.

    Network connection is stable too.

  • Is the disk space OK?? MAybe the swap file takes all disk

  • KiB Swap: 31249404 total,        0 used, 31249404 free. 24408440 cached Mem

    /dev/sda3       1.8T   23G  1.7T   2% /home/dbadmin

    /dev/sda2       197G   12G  176G   7% /

    so we haven't problem free space.

  • OK, back to teh basics,

    What OS do you use? 

    Which Vertica Version?

    Do you have 2 seperate networks, one for backend and one for clients?

     

     

     

    Can you delete the db and recreate it? Although I think that the cluster is going down for another reason

     

  • Yes, i have public and private network.

    From public network i recieve kafka stream.

    Private network used only for cluster communication.

  • Proposal, 

    If you are above Vertica 7.1 have just 1 network! It will be fine! 

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file