Vertica crashes and it is impossible to restart it
Problem description:
1- I have a 3-nodes vertica cluster
2- This cluster crashes and it was impossible to restart it.
Description of actions done:
1- Try to restart using adminTools ==> Failure.
2- Try to recover to last good epoch ==> Failure.
3- Restart hosts from the command line with specifying the same epoch ==> FAILURE
- /opt/vertica/bin/vertica -D /data/<DB_NAME>/v_<DB_NAME>_node0001_catalog -C <DB_NAME> -n v_<DB_NAME>_node0001 -h xx.xx.xxx.xx -p 5433 -P 4803 -S 3238932
- /opt/vertica/bin/vertica -D /data/<DB_NAME>/v_<DB_NAME>_node0001_catalog -C <DB_NAME> -n v_<DB_NAME>_node0001 -h yy.yy.yyy.yy -p 5433 -P 4803 -S 3238932
- /opt/vertica/bin/vertica -D /data/<DB_NAME>/v_<DB_NAME>_node0001_catalog -C <DB_NAME> -n v_<DB_NAME>_node0001 -h zz.zz.zzz.zz -p 5433 -P 4803 -S 3238932
4- Investigation of vertica.log files of the 3 nodes. These are the most relevant errors I collected:
2015-09-13 11:01:22.339 Spread Client:0x6fef300 [Comms] <INFO> v_ossa_node0002 : RECOVER_ERROR |
2015-09-13 11:01:22.339 Spread Client:0x6fef300 [Recover] <INFO> State change for node v_ossa_node0002: RECOVER_ERROR; catalog 3363474, recover to epoch 3238932 |
2015-09-11 15:22:37.015 Spread Client:0x9a64aa0 [Comms] <INFO> v_ossa_node0001 : RECOVERED 2015-09-11 15:22:37.015 Spread Client:0x9a64aa0 [Comms] <INFO> v_ossa_node0002 : RECOVER_ERROR 2015-09-11 15:22:37.015 Spread Client:0x9a64aa0 [Comms] <INFO> v_ossa_node0003 : NEEDS_CATCHUP |
2015-09-14 10:50:22.955 AnalyzeRowCount:0x7f0a7c0141a0 <ERROR> @v_ossa_node0001: {threadShim} 55000/4860: System is not k-safe. DDL is disallowed LOCATION: upgradeToDDLTransaction, /scratch_a/release/30493/vbuild/vertica/Transaction/TransAPI.cpp:3229 |
2015-09-14 10:50:22.955 AnalyzeRowCount:0x7f0a7c0141a0 <ERROR> @v_ossa_node0001: {threadShim} 55000/4860: System is not k-safe. DDL is disallowed LOCATION: upgradeToDDLTransaction, /scratch_a/release/30493/vbuild/vertica/Transaction/TransAPI.cpp:3229 |
2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 <ERROR> @v_ossa_node0002: {runRecover} VX001/3212: Error during recovery running Recovery: split query: (Table: OSSA_FAULT.SUMM_HOURLY_NETWORK) (Projection: OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0) (Epoch: 0x316c15-0x8000000000000000): Index vs. Block size mismatch failed in FileColumnReader: /data/OSSA/v_ossa_node0002_data/775/49539596524455775/49539596524455775_0.fdb Expected block size 96, actual 5 (at /scratch_a/release/30493/vbuild/vertica/SAL/FileColumnReader.cpp:227) LOCATION: runQueries, /scratch_a/release/30493/vbuild/vertica/Recover/CatchUp.cpp:2356 |
0:50:11.722545 ExpirationTimestamp: 2083-10-02 14:04:18.722545 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2 |
2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 <ERROR> @v_ossa_node0002: {runRecover} VX001/3212: Error during recovery running Recovery: split query: (Table: OSSA_FAULT.SUMM_HOURLY_NETWORK) (Projection: OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0) (Epoch: 0x316c15-0x8000000000000000): Index vs. Block size mismatch failed in FileColumnReader: /data/OSSA/v_ossa_node0002_data/775/49539596524455775/49539596524455775_0.fdb Expected block size 96, actual 5 (at /scratch_a/release/30493/vbuild/vertica/SAL/FileColumnReader.cpp:227) LOCATION: runQueries, /scratch_a/release/30493/vbuild/vertica/Recover/CatchUp.cpp:2356 2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 [Recover] <WARNING> Error during recovery for node v_ossa_node0002 2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 [Recover] <INFO> Checking Deps:Down bits: 100 Deps: 011 - cnt: 45 101 - cnt: 45 110 - cnt: 45 111 - cnt: 27 |
2015-09-14 10:50:11.722 CatchUp:0x7fdf54011aa0 <LOG> @v_ossa_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:11.722545 ExpirationTimestamp: 2083-10-02 14:04:18.722545 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2 2015-09-14 10:50:11.723 CatchUp:0x7fdf54011aa0 <LOG> @v_ossa_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:1 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:11.723318 ExpirationTimestamp: 2015-09-14 10:50:11.723318 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 leaving startup state RECOVERING DatabaseName: OSSA Hostname: fasvrta2 |
2015-09-14 10:50:11.728 DistCall Dispatch:0x7fdf4c001e70 [Recover] <INFO> Changing node v_ossa_node0002 startup state from RECOVER_ERROR to RECOVERING |
2015-09-14 10:50:13.185 CatchUp:0x7fdf6401a030 <ERROR> @v_ossa_node0002: {runRecover} VX001/3212: Error during recovery running Recovery: split query: (Table: OSSA_FAULT.SUMM_HOURLY_NETWORK) (Projection: OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0) (Epoch: 0x316c15-0x8000000000000000): Index vs. Block size mismatch failed in FileColumnReader: /data/OSSA/v_ossa_node0002_data/775/49539596524455775/49539596524455775_0.fdb Expected block size 96, actual 5 (at /scratch_a/release/30493/vbuild/vertica/SAL/FileColumnReader.cpp:227) LOCATION: runQueries, /scratch_a/release/30493/vbuild/vertica/Recover/CatchUp.cpp:2356 |
2015-09-14 10:50:13.186 CatchUp:0x7fdf6401a030 <LOG> @v_ossa_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:13.185955 ExpirationTimestamp: 2083-10-02 14:04:20.185955 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2 |
2015-09-14 10:50:13.186 CatchUp:0x7fdf6401a030 <LOG> @v_ossa_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:13.185955 ExpirationTimestamp: 2083-10-02 14:04:20.185955 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2 2015-09-14 10:50:13.186 CatchUp:0x7fdf6401a030 <LOG> @v_ossa_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:1 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:13.186692 ExpirationTimestamp: 2015-09-14 10:50:13.186692 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 leaving startup state RECOVERING DatabaseName: OSSA Hostname: fasvrta2 |
2015-09-14 10:50:40.489 Main:0x68fc930 <LOG> @v_ossa_node0002: 00000/3293: Event Cleared: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:40.489366 ExpirationTimestamp: 2015-09-14 10:50:40.489366 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 leaving startup state RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2 |
2015-09-14 10:50:40.490 Main:0x68fc930 [Recover] <INFO> Changing node v_ossa_node0002 startup state from RECOVER_ERROR to UNSAFE |
Comments
Hi
From the first eye it looks to be an issue related to OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0 (which might have corrupted data) causing recovery to fail on node 02.
So I would suggest you to try and recreate this projection with a different name and try refreshing it and once done drop the corrupted one.
Could you please re-create this projection with a different name, refresh it, then drop the above projections?
1) You can use export_objects to recreate the projection
vsql => select export_objects(' ','<schema>.<tablename>');
2. refresh it
vsql => SELECT REFRESH('OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0_new');
3. Once refreshed, please drop the original/mismatched projections(example statement).
vsql => DROP PROJECTION OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0;
Once it is done try restarting the once again.
Snippet of the vertica.log
=============================
2015-09-14 10:50:13.185 CatchUp:0x7fdf6401a030 <ERROR> @v_ossa_node0002: {runRecover} VX001/3212: Error during recovery running Recovery: split query: (Table: OSSA_FAULT.SUMM_HOURLY_NETWORK) (Projection: OSSA_FAULT.SUMM_HOURLY_NETWORK_v1_b0) (Epoch: 0x316c15-0x8000000000000000): Index vs. Block size mismatch failed in FileColumnReader: /data/OSSA/v_ossa_node0002_data/775/49539596524455775/49539596524455775_0.fdb
Expected block size 96, actual 5 (at /scratch_a/release/30493/vbuild/vertica/SAL/FileColumnReader.cpp:227)
LOCATION: runQueries, /scratch_a/release/30493/vbuild/vertica/Recover/CatchUp.cpp:2356
2015-09-14 10:50:13.186 CatchUp:0x7fdf6401a030 <LOG> @v_ossa_node0002: 00000/3298: Event Posted: Event Code:6 Event Id:3 Event Severity: Informational [6] PostedTimestamp: 2015-09-14 10:50:13.185955 ExpirationTimestamp: 2083-10-02 14:04:20.185955 EventCodeDescription: Node State Change ProblemDescription: Changing node v_ossa_node0002 startup state to RECOVER_ERROR DatabaseName: OSSA Hostname: fasvrta2
Regards
Rahul Choudhary
Thank you Rahul,
The problem here is that all nodes are down, so impossible for me to run any sql request using vsql...
Hi Ali
Aah I see.I missed that part.In that case it will required the catalog editing which can only be done with the help of Vertica Support.
So I would suggest you to go ahead & raise a support case if you have valid support license.
Else you can try restarting the database with the force option and wait for it to offer an epoch value.
$ admintools -t restart_node -F -d <dbname>
Come out of that and see if it gives a closest one and check the same from the "Rollback Database To Last Good Epoch" from the Advanced Menu of admintools utility.
If the values looks to be consistent enough and close enough you can restart your database from the same epoch.
Let me know how it goes.
Regards
Rahul Choudhary
Hi Rahul,
I have opened a case with support team. The investigation is ongoing.
FYI, I have already tried the -force start option and under adminTools, a good epoch is found.
But, the restart still fail with the found epoch...
I will keep you in touch
Hi Ali
If you can provide me the case no. that you have opened with Vertica Support.I can probably look into it also & can provide some help.
Regards
Rahul
It is Case 00048129: Support for Vertica Needed.
Hi Ali
I can see Stefan from Support is assisting you on this case and can vouch that he is capable to guide you through resolution.I would suggest you to continue working with him there and if required can keep me updated about the latest progress.
Regards
Rahul
Hi Rahul,
Sure. Keep in touch and thank you