Options

LOSTCONTACT node startup

 Hi ,

I am testing the backup and restore scenarios and here is what i did

1) on node 1: deleted the datafiles and killed the vertica process

2) node 2 and node 3 are still running

3) restored an old backup on node 1

/opt/vertica/bin/vbr.py --task restore --config-file /opt/vertica/config/Full_Jan_2.ini --nodes=v_verlabqa01_node0001


[==================================================] 100%

All child processes terminated successfully.

restore done!


4) Tried restarting the node 1 and it doesnt come up and then tried to force startup

and still shows lost contact as the status.

Why is this the case? Should the database recover and startup? 

How do I bring the database back to the working condition on node 1?


[dbadmin@genalblabdb07n1 v_verlabqa01_node0001_catalog]$ admintools -t restart_node -F -s genalblabdb07n1 -d VERLABQA01

Info: no password specified, using none

*** Restarting nodes for database VERLABQA01 ***

        restart host 15.224.232.115 with catalog v_verlabqa01_node0001_catalog and data v_verlabqa01_node0001_data

        issuing multi-node restart

        Node Status: v_verlabqa01_node0001: (DOWN)

        Node Status: v_verlabqa01_node0001: (INITIALIZING)

        Node Status: v_verlabqa01_node0001: (RECOVERING)

        Node Status: v_verlabqa01_node0001: (LOSTCONTACT)

Nodes UP: v_verlabqa01_node0002, v_verlabqa01_node0003

Nodes DOWN: v_verlabqa01_node0001 (may be still initializing).

        result of multi-node restart:  7

Restart Nodes result:  7

[dbadmin@genalblabdb07n1 v_verlabqa01_node0001_catalog]$ tail -f vertica.log
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [Init] <INFO> Global pool memory usage: NewPool(0x4735320) 'GlobalPool': totalDtors 0 totalSize 65011712 (22546864 unused) totalChunks 5
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [Init] <INFO> SAL global pool memory usage: NewPool(0x4725380) 'SALGlobalPool': totalDtors 0 totalSize 2097152 (2096864 unused) totalChunks 1
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [Init] <INFO> SS::stopPoller()
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [Init] <INFO> DC::shutDown()
2014-01-14 16:09:41.267 unknown:0x7f53adb9e700 [Init] <INFO> Shutdown complete. Exiting.




Comments

  • Options
    LOSTCONTACT is due to several issues.
    Below are some of the things to be checked when lost contact is experienced by the vertica cluster
    1) Restart spread on all the nodes and try starting the database.
    If spread restart fails then the same has to be fixed.
    2) try starting the database from any other node other than the node1
    if it still fails then check to see in the vertica.log what are the additional messages printed before it planned to go down.
    3) check if there is sufficient space on the CATALOG and DATA folders where the file system is mounted.
    minimum recommended be to anywhere between 20-30% free space for the DB to start up, for better performance operations the recommendations are 50-60% free space.
    4) Many time the catalog on one of the node corrupts leading to this issue, this can be identified by the looking into the vertica.log.
    once identified the node having the catalog corruption, stop the spread on that node and start the database.
    Once the database is up start the spread and force restart the node using the admintools. This will recover all the files on that node if corrupt.
    If the above node fails to start with the force option then advance the AHM using select MAKE_AHM_NOW('true'); and then use the same force restart option. This will recover the node from scratch.
    5) This could also be due to any data file being corrupted on the filesystem , follow the same steps as in step4.
    6) Finally if nothign is working that means the network cards are faulty and they need to fixed.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file