Vertica crashes and it is impossible to restart it

Problem description:

  1- I have a 3-nodes vertica cluster

  2- This cluster crashes and it was impossible to restart it.

 

Description of actions done:

  1- Try to restart using adminTools ==> Failure.

  2- Try to recover to last good epoch ==> Failure.

  3- Restart hosts from the command line with specifying the same epoch  ==> FAILURE

        - /opt/vertica/bin/vertica -D /data/<DB_NAME>/v_<DB_NAME>_node0001_catalog -C <DB_NAME> -n v_<DB_NAME>_node0001 -h xx.xx.xxx.xx -p 5433 -P 4803 -S 3238932

         - /opt/vertica/bin/vertica -D /data/<DB_NAME>/v_<DB_NAME>_node0001_catalog -C <DB_NAME> -n v_<DB_NAME>_node0001 -h yy.yy.yyy.yy -p 5433 -P 4803 -S 3238932

         - /opt/vertica/bin/vertica -D /data/<DB_NAME>/v_<DB_NAME>_node0001_catalog -C <DB_NAME> -n v_<DB_NAME>_node0001 -h zz.zz.zzz.zz -p 5433 -P 4803 -S 3238932

 

  4- Investigation of vertica.log files of the 3 nodes. These are the most relevant errors I collected:

 

2015-09-13 11:01:22.339 Spread Client:0x6fef300 [Comms] <INFO>   v_ossa_node0002 : RECOVER_ERROR

2015-09-13 11:01:22.339 Spread Client:0x6fef300 [Recover] <INFO> State change for node v_ossa_node0002: RECOVER_ERROR; catalog 3363474, recover to epoch 3238932

2015-09-11 15:22:37.015 Spread Client:0x9a64aa0 [Comms] <INFO>   v_ossa_node0001 : RECOVERED

2015-09-11 15:22:37.015 Spread Client:0x9a64aa0 [Comms] <INFO>   v_ossa_node0002 : RECOVER_ERROR

2015-09-11 15:22:37.015 Spread Client:0x9a64aa0 [Comms] <INFO>   v_ossa_node0003 : NEEDS_CATCHUP

2015-09-14 10:50:22.955 AnalyzeRowCount:0x7f0a7c0141a0 <ERROR> @v_ossa_node0001: {threadShim} 55000/4860: System is not k-safe. DDL is disallowed

        LOCATION:  upgradeToDDLTransaction, /scratch_a/release/30493/vbuild/vertica/Transaction/TransAPI.cpp:3229

2015-09-14 10:50:22.955 AnalyzeRowCount:0x7f0a7c0141a0 <ERROR> @v_ossa_node0001: {threadShim} 55000/4860: System is not k-safe. DDL is disallowed

        LOCATION:  upgradeToDDLTransaction, /scratch_a/release/30493/vbuild/vertica/Transaction/TransAPI.cpp:3229

2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 <ERROR> @v_ossa_node0002: {runRecover} VX001/3212: Error during recovery running Recovery: split query: (Table: OSSA_FAULT.SUMM_HOURLY_NETWORK) (Projection: OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0) (Epoch: 0x316c15-0x8000000000000000): Index vs. Block size mismatch failed in FileColumnReader: /data/OSSA/v_ossa_node0002_data/775/49539596524455775/49539596524455775_0.fdb

Expected block size 96, actual 5 (at /scratch_a/release/30493/vbuild/vertica/SAL/FileColumnReader.cpp:227)

        LOCATION:  runQueries, /scratch_a/release/30493/vbuild/vertica/Recover/CatchUp.cpp:2356

0:50:11.722545 ExpirationTimestamp: 2083-10-02 14:04:18.722545 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2

2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 <ERROR> @v_ossa_node0002: {runRecover} VX001/3212: Error during recovery running Recovery: split query: (Table: OSSA_FAULT.SUMM_HOURLY_NETWORK) (Projection: OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0) (Epoch: 0x316c15-0x8000000000000000): Index vs. Block size mismatch failed in FileColumnReader: /data/OSSA/v_ossa_node0002_data/775/49539596524455775/49539596524455775_0.fdb

Expected block size 96, actual 5 (at /scratch_a/release/30493/vbuild/vertica/SAL/FileColumnReader.cpp:227)

        LOCATION:  runQueries, /scratch_a/release/30493/vbuild/vertica/Recover/CatchUp.cpp:2356

2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 [Recover] <WARNING> Error during recovery for node v_ossa_node0002

2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 [Recover] <INFO> Checking Deps:Down bits: 100 Deps:

011 - cnt: 45

101 - cnt: 45

110 - cnt: 45

111 - cnt: 27

2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 <LOG> @v_ossa_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:11.722545 ExpirationTimestamp: 2083-10-02 14:04:18.722545 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2

2015-09-14 10:50:11.723 CatchUp:0x7fdf54011aa0 <LOG> @v_ossa_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:1 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:11.723318 ExpirationTimestamp: 2015-09-14 10:50:11.723318 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 leaving startup state RECOVERING DatabaseName: OSSA Hostname: fasvrta2

2015-09-14 10:50:11.728 DistCall Dispatch:0x7fdf4c001e70 [Recover] <INFO> Changing node v_ossa_node0002 startup state from RECOVER_ERROR to RECOVERING

2015-09-14 10:50:13.185 CatchUp:0x7fdf6401a030 <ERROR> @v_ossa_node0002: {runRecover} VX001/3212: Error during recovery running Recovery: split query: (Table: OSSA_FAULT.SUMM_HOURLY_NETWORK) (Projection: OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0) (Epoch: 0x316c15-0x8000000000000000): Index vs. Block size mismatch failed in FileColumnReader: /data/OSSA/v_ossa_node0002_data/775/49539596524455775/49539596524455775_0.fdb

Expected block size 96, actual 5 (at /scratch_a/release/30493/vbuild/vertica/SAL/FileColumnReader.cpp:227)

        LOCATION:  runQueries, /scratch_a/release/30493/vbuild/vertica/Recover/CatchUp.cpp:2356

2015-09-14 10:50:13.186 CatchUp:0x7fdf6401a030 <LOG> @v_ossa_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:13.185955 ExpirationTimestamp: 2083-10-02 14:04:20.185955 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2

2015-09-14 10:50:13.186 CatchUp:0x7fdf6401a030 <LOG> @v_ossa_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:13.185955 ExpirationTimestamp: 2083-10-02 14:04:20.185955 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2

2015-09-14 10:50:13.186 CatchUp:0x7fdf6401a030 <LOG> @v_ossa_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:1 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:13.186692 ExpirationTimestamp: 2015-09-14 10:50:13.186692 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 leaving startup state RECOVERING DatabaseName: OSSA Hostname: fasvrta2

2015-09-14 10:50:40.489 Main:0x68fc930 <LOG> @v_ossa_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:40.489366 ExpirationTimestamp: 2015-09-14 10:50:40.489366 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 leaving startup state RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2

2015-09-14 10:50:40.490 Main:0x68fc930 [Recover] <INFO> Changing node v_ossa_node0002 startup state from RECOVER_ERROR to UNSAFE

Comments

  • Hi

     

    From the first eye it looks to be an issue related to OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0 (which might have corrupted data) causing recovery to fail on node 02.

    So I would suggest you to try and recreate this projection with a different name and try refreshing it and once done drop the corrupted one.

     

    Could you please re-create this projection with a different name, refresh it, then drop the above projections?

     

    1) You can use export_objects to recreate the projection

     

    vsql => select export_objects(' ','<schema>.<tablename>');

     

    2. refresh it
    vsql => SELECT REFRESH('OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0_new');

     

    3. Once refreshed, please drop the original/mismatched projections(example statement).


    vsql => DROP PROJECTION OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0;

     

    Once it is done try restarting the once again.

     

    Snippet of the vertica.log
    =============================

    2015-09-14 10:50:13.185 CatchUp:0x7fdf6401a030 <ERROR> @v_ossa_node0002: {runRecover} VX001/3212: Error during recovery running Recovery: split query: (Table: OSSA_FAULT.SUMM_HOURLY_NETWORK) (Projection: OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0) (Epoch: 0x316c15-0x8000000000000000): Index vs. Block size mismatch failed in FileColumnReader: /data/OSSA/v_ossa_node0002_data/775/49539596524455775/49539596524455775_0.fdb
    Expected block size 96, actual 5 (at /scratch_a/release/30493/vbuild/vertica/SAL/FileColumnReader.cpp:227)
    LOCATION: runQueries, /scratch_a/release/30493/vbuild/vertica/Recover/CatchUp.cpp:2356
    2015-09-14 10:50:13.186 CatchUp:0x7fdf6401a030 <LOG> @v_ossa_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:13.185955 ExpirationTimestamp: 2083-10-02 14:04:20.185955 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2

     

    Regards

    Rahul Choudhary

     

  • Thank you Rahul,

     

       The problem here is that all nodes are down, so impossible for me to run any sql request using vsql...

  • Hi Ali

     

    Aah I see.I missed that part.In that case it will required the catalog editing which can only be done with the help of Vertica Support.

     

    So I would suggest you to go ahead & raise a support case if you have valid support license.

     

    Else you can try restarting the database with the force option and wait for it to offer an epoch value.

     

    $ admintools -t restart_node -F  -d <dbname> 

     

    Come out of that and see if it gives a closest one and check the same from the "Rollback Database To Last Good Epoch" from the Advanced Menu of admintools utility.

     

    If the values looks to be consistent enough and close enough you can restart your database from the same epoch.

     

    Let me know how it goes.

     

    Regards

    Rahul Choudhary

  • Hi Rahul,

     

       I have opened a case with support team. The investigation is ongoing.

       FYI, I have already tried the -force start option and under adminTools, a good epoch is found.

       But, the restart still fail with the found epoch...

       I will keep you in touch

      

  • Hi Ali

     

    If you can provide me the case no. that you have opened with Vertica Support.I can probably look into it also & can provide some help.

     

    Regards

    Rahul 

  • It is Case 00048129: Support for Vertica Needed.

  • Hi Ali

     

    I can see Stefan from Support is assisting you on this case and can vouch that he is capable to guide you through resolution.I would suggest you to continue working with him there and if required can keep me updated about the latest progress.

     

    Regards

    Rahul

  • Hi Rahul,

     

      Sure. Keep in touch and thank you

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file