We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Vertica crashed after few days working. — Vertica Forum

Vertica crashed after few days working.

Vertica died day after day, what can be a problem?

How i can investigate problem?

Last log from spread before crash:

 

[Thu 15 Dec 2016 01:49:48] G_handle_reg_memb: with (10.0.0.58, 1481755789) id
[Thu 15 Dec 2016 01:49:48] G_handle_reg_memb in GTRANS
[Thu 15 Dec 2016 01:49:48] Sess_enable_heartbeats: explict = 1, thresh = 0, heartbeats_on = 0
[Thu 15 Dec 2016 01:49:48] G_handle_kill: #node_a#N010000000058 is killed
[Thu 15 Dec 2016 01:49:48] G_handle_kill in GOP
[Thu 15 Dec 2016 01:49:48] Daemon idle, exiting
Exit caused by Alarm(EXIT)
[Thu 15 Dec 2016 01:49:48] Sess: unlinked domain socket file /tmp/4803; ret=0

 

Dblog:

Connecting to spread at 4803
Connected to spread on local domain socket /tmp/4803
auto restart closing socket
Starting UDxSideProcess for language C++
with command line: /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0002-68312:0x2 debug-log-off /home/dbadmin/event/v_event_node0002_catalog/UDxLogs 5
12/13/16 12:16:36 SP_connect: unable to connect via UNIX socket to /tmp/4803 (pid=68312): Error: No such file or directory
12/13/16 12:16:36 SP_connect: DEBUG: Auth list is: NULL
12/13/16 12:16:36 SP_connect: connected with private group(21 bytes): #node_a#N010000000058; mbox=13, pid=68312
12/14/16 19:49:31 SP_disconnect: mbox=13, pid=68312, send_group=#node_a#N010000000058

 

+ using vkafka connector.

Comments

  • K-safe = 1

  • Can you restart the DB?

    Is it going up OK?

     

    Can you connect with ssh to all nodes?

  • After restart - everything fine.

    Yes, i can connect throw nodes.

  • So, is it solved? 

  • No, vertica will die after 1 or 2 day solid work.

    Will check logs after crash.

    Maybe exist solution to restart nodes automaticly ?

  • Did you check of any network interruptions during night? What I have seen is that latency is very important for Vertica

  • Will check.

  • latency - ok,very good idea.

    We have 10g switch and same network card.

    I will measure latency and drop packets if they really exist.

  • /opt/vertica/bin/adminTools -t view_cluster -x

    DB | Host | State

    -------+-----------+-------

    mdb | 10.1.0.1 | DOWN

    mdb | 10.1.0.2 | DOWN

    mdb | 10.1.0.3 | DOWN

    And again...

  • Do you create /tmp/4803 ???

     

    Files in tmp might cause issues! 

  • No,i don't create spread file in /tmp by myself.

    ll /tmp/
    total 20
    drwxrwxrwt 5 root root 4096 Dec 20 12:20 ./
    drwxr-xr-x 22 root root 4096 Sep 28 00:23 ../
    drwxrwxrwt 2 root root 4096 Dec 16 01:01 .ICE-unix/
    drwx------ 2 root root 4096 Dec 18 02:27 mc-root/

    empty here.

    Node fail again.

     

    I have this log in spread.log

     

    [Tue 20 Dec 2016 04:39:44] Sess_read: received a heartbeat on 'node_b' ( mailbox 9 )
    [Tue 20 Dec 2016 04:39:44] Pushed eviction timeout back 600.000000s
    [Tue 20 Dec 2016 04:40:38] Sess_read: failed receiving header on session 9: ret 0: error: No such file or directory
    [Tue 20 Dec 2016 04:40:38] Sess_kill: killing session node_b ( mailbox 9 )
    [Tue 20 Dec 2016 04:40:38] G_handle_kill: #node_b#N010000000059 is killed
    [Tue 20 Dec 2016 04:40:38] G_handle_kill in GOP
    [Tue 20 Dec 2016 04:40:38] Daemon idle, exiting
    Exit caused by Alarm(EXIT)
    [Tue 20 Dec 2016 04:40:38] Sess: unlinked domain socket file /tmp/4803; ret=0

     

  • Conf_load_conf_file: using file: /home/dbadmin/event/v_event_node0001_catalog/spread.conf
    Setting active IP version to 0
    Successfully configured Segment 0 [10.0.0.57]:4803 with 1 procs:
    N010000000057: 10.0.0.57
    Successfully configured Segment 1 [10.0.0.58]:4803 with 1 procs:
    N010000000058: 10.0.0.58
    Successfully configured Segment 2 [10.0.0.59]:4803 with 1 procs:
    N010000000059: 10.0.0.59
    Connected to spread on local domain socket /tmp/4803
    auto restart closing socket
    Starting UDxSideProcess for language C++
    with command line: /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0001-3528:0x2 debug-log-off /home/dbadmin/event/v_event_node0001_catalog/UDxLogs 5
    Starting UDxSideProcess for language C++
    with command line: /opt/vertica/bin/vertica-udx-C++ 4 v_event_node0001-3528:0xe debug-log-off /home/dbadmin/event/v_event_node0001_catalog/UDxLogs 5
    *** rdkafka_buf.h:365:rd_kafka_buf_write: assert: rkbuf->rkbuf_wof + len <= rkbuf->rkbuf_size ***
    12/19/16 16:53:33 SP_connect: unable to connect via UNIX socket to /tmp/4803 (pid=3528): Error: No such file or directory
    12/19/16 16:53:34 SP_connect: DEBUG: Auth list is: NULL
    12/19/16 16:53:34 SP_connect: connected with private group(21 bytes): #node_b#N010000000059; mbox=13, pid=3528

    this i have in dblog.

     

    Latency is stable.

    Network connection is stable too.

  • Is the disk space OK?? MAybe the swap file takes all disk

  • KiB Swap: 31249404 total,        0 used, 31249404 free. 24408440 cached Mem

    /dev/sda3       1.8T   23G  1.7T   2% /home/dbadmin

    /dev/sda2       197G   12G  176G   7% /

    so we haven't problem free space.

  • OK, back to teh basics,

    What OS do you use? 

    Which Vertica Version?

    Do you have 2 seperate networks, one for backend and one for clients?

     

     

     

    Can you delete the db and recreate it? Although I think that the cluster is going down for another reason

     

  • Yes, i have public and private network.

    From public network i recieve kafka stream.

    Private network used only for cluster communication.

  • Proposal, 

    If you are above Vertica 7.1 have just 1 network! It will be fine! 

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file