Cluster crashing during projection refresh

We recently added an additional node to our cluster. The DBD didn't rebalance the projections we created manually (presumably because they were manual, not from table projections?) over the new node, so we are trying to create them ourselves, then drop the old one.

 

Starting a refresh of the new projections works OK. However after about 18 hours some of the nodes crash. Consistently the same nodes crash too (all but one). There is nothing in vertica.log. Nodes are behaving as normal, answering queries etc, then the log entries just stop and the process is gone.

 

For example:

 

2016-04-22 03:04:22.032 AnalyzeRowCount:0x7f0db4012830-c000000001ea8d [Txn] <INFO> Begin Txn: c000000001ea8d 'getRowCountsForProj'
2016-04-22 03:04:22.046 AnalyzeRowCount:0x7f0db4012830-c000000001ea8d [Txn] <INFO> Rollback Txn: c000000001ea8d 'getRowCountsForProj'
2016-04-22 03:04:22.382 AnalyzeRowCount:0x7f0db4012830-c000000001ea8e [Txn] <INFO> Begin Txn: c000000001ea8e 'do_tm_task row count analyze'
2016-04-22 03:04:22.383 AnalyzeRowCount:0x7f0db4012830-c000000001ea8e [Txn] <INFO> Rollback Txn: c000000001ea8e 'do_tm_task row count analyze'
2016-04-22 03:04:23.159 AnalyzeRowCount:0x7f0db4012830-c000000001ea8f [Txn] <INFO> Begin Txn: c000000001ea8f 'static void CAT::CatalogQueries::reserveJVMForProjectionIfReqd(OidSet)'
2016-04-22 03:04:23.159 AnalyzeRowCount:0x7f0db4012830-c000000001ea8f [Txn] <INFO> Rollback Txn: c000000001ea8f 'static void CAT::CatalogQueries::reserveJVMForProjectionIfReqd(OidSet)'
2016-04-22 03:04:23.159 AnalyzeRowCount:0x7f0db4012830 [Command] <INFO> TMTask: row count analyze - start
2016-04-22 03:04:23.159 AnalyzeRowCount:0x7f0db4012830 [Command] <INFO> TMTask: row count analyze - Done
2016-04-22 03:04:23.160 AnalyzeRowCount:0x7f0db4012830 [Util] <INFO> Task 'AnalyzeRowCount' enabled
2016-04-22 03:04:25.000 DiskSpaceRefresher:0x7f0db4012520 [Util] <INFO> Task 'DiskSpaceRefresher' enabled
2016-04-22 03:04:35.000 DiskSpaceRefresher:0x7f0db4012280 [Util] <INFO> Task 'DiskSpaceRefresher' enabled
2016-04-22 03:04:36.000 TM Moveout:0x7f0db40149e0-c000000001ea90 [Txn] <INFO> Begin Txn: c000000001ea90 'Moveout: Tuple Mover'
2016-04-22 03:04:36.000 TM Moveout:0x7f0db40149e0-c000000001ea90 [Txn] <INFO> Rollback Txn: c000000001ea90 'Moveout: Tuple Mover'
2016-04-22 03:04:36.000 TM Moveout:0x7f0db40149e0 [Util] <INFO> Task 'TM Moveout' enabled
2016-04-22 03:04:45.000 DiskSpaceRefresher:0x7f0db4012830 [Util] <INFO> Task 'DiskSpaceRefresher' enabled
2016-04-22 03:04:55.000 DiskSpaceRefresher:0x7f0db4012280 [Util] <INFO> Task 'DiskSpaceRefresher' enabled

... Then silence.

 

The same is found on the other nodes that crashed.

 

From the projection_refreshes table I get:

 

 v_analytics_node0001 | public            | 49539603694838412 | event_order_project_time_b0       | event             | failed: Unavailable: [Txn 0xa0000004faffdd] X lock table - timeout error Timed out X locking Projection:public.event_order_proje |               |                |                     5 | v_analytics_node000-501741:0x3b6d | 2000-01-01 00:00:00+00        | 514506376            | f            |                  |
v_analytics_node0001 | public | 49539603694838670 | event_order_project_time_b1 | event | failed: Unavailable: [Txn 0xa0000004faffdd] X lock table - timeout error Timed out X locking Projection:public.event_order_proje | | | 5 | v_analytics_node000-501741:0x3b6d | 2000-01-01 00:00:00+00 | 514506376 | f | |

Notice the timestamps set to the year 2000?

 

Not really sure where to go from here. Slightly frusting as it takes nearly 24 hours to try, and then the cluster has to be recovered each time.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file