Node going down when loading data

Siddarth · June 2014

I am trying load data in Vertica 3 node cluster. It loads the data initially, but after and hour or so that node goes down. There is nothing in the vertica.log which shows the reason.

How can I debug this issue?

Logs of the down node:
=======
$ tail -50 /data/spdrdb/v_spdrdb_node0001_catalog/vertica.log FROM v_catalog.columns AS c
LEFT OUTER JOIN v_catalog.primary_keys p ON c.table_schema=p.table_schema
AND c.table_name=p.table_name AND c.column_name=p.column_name
LEFT OUTER JOIN (SELECT table_schema, table_name, column_name,
quote_ident(MAX(reference_table_schema)) as rts,
quote_ident(MAX(reference_table_name)) as rtn,
quote_ident(MAX(reference_column_name)) as rcn,
count(*) as c
FROM v_catalog.foreign_keys
WHERE table_schema ILIKE E'%' ESCAPE E'\\' AND table_name ILIKE E'SUBSCRIPTION\\_DIMENSION' ESCAPE E'\\'
GROUP BY table_schema, table_name, column_name
) AS f
ON c.table_schema=f.table_schema AND c.table_name=f.table_name
AND c.column_name=f.column_name
WHERE c.table_schema ILIKE E'%' ESCAPE E'\\' AND c.table_name ILIKE E'SUBSCRIPTION\\_DIMENSION' ESCAPE E'\\'
ORDER BY c.table_schema, c.table_name, c.ordinal_position;'
2014-06-12 19:37:36.001 LowDiskSpaceCheck:0x7f3568014120-a000000038ec63 [Txn] <INFO> Begin Txn: a000000038ec63 'LowDiskSpaceCheck'
2014-06-12 19:37:36.001 LowDiskSpaceCheck:0x7f3568014120-a000000038ec63 [Txn] <INFO> Rollback Txn: a000000038ec63 'LowDiskSpaceCheck'
2014-06-12 19:37:36.002 LowDiskSpaceCheck:0x7f3568014120 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:37:46.005 TM Moveout:0x7f3568014f60-a000000038ec64 [Txn] <INFO> Begin Txn: a000000038ec64 'Moveout: Tuple Mover'
2014-06-12 19:37:46.005 LGELaggingCheck:0x7f3568012b30-a000000038ec65 [Txn] <INFO> Begin Txn: a000000038ec65 'Recovery: Get last good epoch'
2014-06-12 19:37:46.006 TM Moveout:0x7f3568014f60-a000000038ec64 [Txn] <INFO> Rollback Txn: a000000038ec64 'Moveout: Tuple Mover'
2014-06-12 19:37:46.010 TM Moveout:0x7f3568014f60 [Util] <INFO> Task 'TM Moveout' enabled
2014-06-12 19:37:46.010 LowDiskSpaceCheck:0x7f35680131e0-a000000038ec68 [Txn] <INFO> Begin Txn: a000000038ec68 'LowDiskSpaceCheck'
2014-06-12 19:37:46.011 LowDiskSpaceCheck:0x7f35680131e0-a000000038ec68 [Txn] <INFO> Rollback Txn: a000000038ec68 'LowDiskSpaceCheck'
2014-06-12 19:37:46.011 LGELaggingCheck:0x7f3568012b30-a000000038ec65 [Txn] <INFO> Starting Commit: Txn: a000000038ec65 'Recovery: Get last good epoch'
2014-06-12 19:37:46.011 LGELaggingCheck:0x7f3568012b30 [Txn] <INFO> Commit Complete: Txn: a000000038ec65 at epoch 0x1a294
2014-06-12 19:37:46.011 AnalyzeRowCount:0x7f3568014cd0-a000000038ec69 [Txn] <INFO> Begin Txn: a000000038ec69 'getRowCountsForProj'
2014-06-12 19:37:46.011 LowDiskSpaceCheck:0x7f35680131e0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:37:46.012 LGELaggingCheck:0x7f3568012b30-a000000038ec6a [Txn] <INFO> Begin Txn: a000000038ec6a 'Check LGE'
2014-06-12 19:37:46.012 LGELaggingCheck:0x7f3568012b30-a000000038ec6a [Txn] <INFO> Rollback Txn: a000000038ec6a 'Check LGE'
2014-06-12 19:37:46.012 LGELaggingCheck:0x7f3568012b30 [Util] <INFO> Task 'LGELaggingCheck' enabled
2014-06-12 19:37:46.017 ManageEpochs:0x7f3568014940-a000000038ec67 [Txn] <INFO> Begin Txn: a000000038ec67 'Manage Epochs: Advance AHM'
2014-06-12 19:37:46.017 RebalanceCluster:0x7f35680147c0-a000000038ec66 [Txn] <INFO> Begin Txn: a000000038ec66 'rebalance_cluster(background)'
2014-06-12 19:37:46.018 RebalanceCluster:0x7f35680147c0-a000000038ec66 [Txn] <INFO> Rollback Txn: a000000038ec66 'rebalance_cluster(background)'
2014-06-12 19:37:46.021 RebalanceCluster:0x7f35680147c0 [Util] <INFO> Task 'RebalanceCluster' enabled
2014-06-12 19:37:46.022 ManageEpochs:0x7f3568014940-a000000038ec67 [Catalog] <INFO> Checking for missed alter partition events
2014-06-12 19:37:46.022 ManageEpochs:0x7f3568014940-a000000038ec67 [Catalog] <INFO> Found no missed alter partition events
2014-06-12 19:37:46.022 ManageEpochs:0x7f3568014940-a000000038ec67 [Catalog] <INFO> Checking for missed restore table events
2014-06-12 19:37:46.022 ManageEpochs:0x7f3568014940-a000000038ec67 [Catalog] <INFO> Found no missed restore table events
2014-06-12 19:37:46.025 ManageEpochs:0x7f3568014940-a000000038ec67 [Recover] <INFO> Get Local Node LGE: node current cpe is 0x1a293
2014-06-12 19:37:46.025 ManageEpochs:0x7f3568014940-a000000038ec67 [Recover] <INFO> Get Local Node LGE: projection spdr.SUBSCRIBER_DEVICE_MAPPING_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:37:46.025 ManageEpochs:0x7f3568014940-a000000038ec67 [Recover] <INFO> Get Local Node LGE: projection spdr.SUBSCRIBER_DEVICE_MAPPING_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:37:46.025 ManageEpochs:0x7f3568014940-a000000038ec67 [Recover] <INFO> Get Local Node LGE: projection spdr.DEVICE_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:37:46.025 ManageEpochs:0x7f3568014940-a000000038ec67 [Recover] <INFO> Get Local Node LGE: projection spdr.DEVICE_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:37:46.025 ManageEpochs:0x7f3568014940-a000000038ec67 [Recover] <INFO> Get Local Node LGE: projection spdr.SUBSCRIBER_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:37:46.025 ManageEpochs:0x7f3568014940-a000000038ec67 [Recover] <INFO> Get Local Node LGE: projection spdr.SUBSCRIBER_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:37:46.025 ManageEpochs:0x7f3568014940-a000000038ec67 [Recover] <INFO> Get Local Node LGE: projection spdr.ACCESS_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:37:46.025 ManageEpochs:0x7f3568014940-a000000038ec67 [Recover] <INFO> Get Local Node LGE: projection spdr.ACCESS_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:37:46.025 ManageEpochs:0x7f35[dbadmin@spdrdal1 Dimension_Data]$ tail -100 /data/spdrdb/v_spdrdb_node0001_catalog/vertica.log
2014-06-12 19:42:46.017 RebalanceCluster:0x7f35680127c0 [Util] <INFO> Task 'RebalanceCluster' enabled
2014-06-12 19:42:46.018 LGELaggingCheck:0x7f35680147c0-a000000038ec96 [Txn] <INFO> Starting Commit: Txn: a000000038ec96 'Recovery: Get last good epoch'
2014-06-12 19:42:46.019 LGELaggingCheck:0x7f35680147c0 [Txn] <INFO> Commit Complete: Txn: a000000038ec96 at epoch 0x1a294
2014-06-12 19:42:46.019 LGELaggingCheck:0x7f35680147c0-a000000038ec99 [Txn] <INFO> Begin Txn: a000000038ec99 'Check LGE'
2014-06-12 19:42:46.020 LGELaggingCheck:0x7f35680147c0-a000000038ec99 [Txn] <INFO> Rollback Txn: a000000038ec99 'Check LGE'
2014-06-12 19:42:46.021 LGELaggingCheck:0x7f35680147c0 [Util] <INFO> Task 'LGELaggingCheck' enabled
2014-06-12 19:42:46.148 TM Mergeout(01):0x7f3568014940-a000000038ec93 [TM] <INFO> Tuple Mover: nothing to merge out
2014-06-12 19:42:46.157 TM Mergeout(01):0x7f3568014940-a000000038ec93 [TM] <INFO> Tuple Mover: no DV to merge out
2014-06-12 19:42:46.157 TM Mergeout(01):0x7f3568014940-a000000038ec93 [Txn] <INFO> Rollback Txn: a000000038ec93 'Mergeout: Tuple Mover'
2014-06-12 19:42:46.159 TM Mergeout(01):0x7f3568014940 [Util] <INFO> Task 'TM Mergeout(01)' enabled
2014-06-12 19:42:46.173 AnalyzeRowCount:0x7f3568012f60-a000000038ec98 [Txn] <INFO> Rollback Txn: a000000038ec98 'getRowCountsForProj'
2014-06-12 19:42:46.175 TM Mergeout(00):0x7f35680150a0-a000000038ec95 [TM] <INFO> Tuple Mover: nothing to merge out
2014-06-12 19:42:46.178 TM Mergeout(00):0x7f35680150a0-a000000038ec95 [TM] <INFO> Tuple Mover: no DV to merge out
2014-06-12 19:42:46.178 TM Mergeout(00):0x7f35680150a0-a000000038ec95 [Txn] <INFO> Rollback Txn: a000000038ec95 'Mergeout: Tuple Mover'
2014-06-12 19:42:46.180 TM Mergeout(00):0x7f35680150a0-a000000038ec9a [Txn] <INFO> Begin Txn: a000000038ec9a 'collectMoveStorageJobs'
2014-06-12 19:42:46.186 TM Mergeout(00):0x7f35680150a0-a000000038ec9a [Txn] <INFO> Rollback Txn: a000000038ec9a 'collectMoveStorageJobs'
2014-06-12 19:42:46.187 TM Mergeout(00):0x7f35680150a0 [Util] <INFO> Task 'TM Mergeout(00)' enabled
2014-06-12 19:42:46.196 AnalyzeRowCount:0x7f3568012f60-a000000038ec9b [Txn] <INFO> Begin Txn: a000000038ec9b 'do_analyze_row_count'
2014-06-12 19:42:46.201 AnalyzeRowCount:0x7f3568012f60-a000000038ec9b [Command] <INFO> TMTask: row count analyze - start
2014-06-12 19:42:46.204 AnalyzeRowCount:0x7f3568012f60-a000000038ec9b [Command] <INFO> TMTask: row count analyze - Done
2014-06-12 19:42:46.205 AnalyzeRowCount:0x7f3568012f60-a000000038ec9b [Txn] <INFO> Rollback Txn: a000000038ec9b 'do_analyze_row_count'
2014-06-12 19:42:46.207 AnalyzeRowCount:0x7f3568012f60 [Util] <INFO> Task 'AnalyzeRowCount' enabled
2014-06-12 19:42:56.001 LowDiskSpaceCheck:0x7f35680147c0-a000000038ec9c [Txn] <INFO> Begin Txn: a000000038ec9c 'LowDiskSpaceCheck'
2014-06-12 19:42:56.001 LowDiskSpaceCheck:0x7f35680147c0-a000000038ec9c [Txn] <INFO> Rollback Txn: a000000038ec9c 'LowDiskSpaceCheck'
2014-06-12 19:42:56.002 LowDiskSpaceCheck:0x7f35680147c0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:43:06.001 LowDiskSpaceCheck:0x7f3568014cd0-a000000038ec9d [Txn] <INFO> Begin Txn: a000000038ec9d 'LowDiskSpaceCheck'
2014-06-12 19:43:06.001 LowDiskSpaceCheck:0x7f3568014cd0-a000000038ec9d [Txn] <INFO> Rollback Txn: a000000038ec9d 'LowDiskSpaceCheck'
2014-06-12 19:43:06.001 LowDiskSpaceCheck:0x7f3568014cd0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:43:16.002 LowDiskSpaceCheck:0x7f3568014cd0-a000000038ec9e [Txn] <INFO> Begin Txn: a000000038ec9e 'LowDiskSpaceCheck'
2014-06-12 19:43:16.002 LowDiskSpaceCheck:0x7f3568014cd0-a000000038ec9e [Txn] <INFO> Rollback Txn: a000000038ec9e 'LowDiskSpaceCheck'
2014-06-12 19:43:16.002 LowDiskSpaceCheck:0x7f3568014cd0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:43:26.001 LowDiskSpaceCheck:0x7f35680147c0-a000000038ec9f [Txn] <INFO> Begin Txn: a000000038ec9f 'LowDiskSpaceCheck'
2014-06-12 19:43:26.001 LowDiskSpaceCheck:0x7f35680147c0-a000000038ec9f [Txn] <INFO> Rollback Txn: a000000038ec9f 'LowDiskSpaceCheck'
2014-06-12 19:43:26.002 LowDiskSpaceCheck:0x7f35680147c0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:43:36.001 LowDiskSpaceCheck:0x7f3568014cd0-a000000038eca0 [Txn] <INFO> Begin Txn: a000000038eca0 'LowDiskSpaceCheck'
2014-06-12 19:43:36.001 LowDiskSpaceCheck:0x7f3568014cd0-a000000038eca0 [Txn] <INFO> Rollback Txn: a000000038eca0 'LowDiskSpaceCheck'
2014-06-12 19:43:36.001 LowDiskSpaceCheck:0x7f3568014cd0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:43:46.004 LowDiskSpaceCheck:0x7f35680127c0-a000000038eca2 [Txn] <INFO> Begin Txn: a000000038eca2 'LowDiskSpaceCheck'
2014-06-12 19:43:46.005 LowDiskSpaceCheck:0x7f35680127c0-a000000038eca2 [Txn] <INFO> Rollback Txn: a000000038eca2 'LowDiskSpaceCheck'
2014-06-12 19:43:46.005 LowDiskSpaceCheck:0x7f35680127c0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:43:46.006 AnalyzeRowCount:0x7f35680147c0-a000000038eca3 [Txn] <INFO> Begin Txn: a000000038eca3 'getRowCountsForProj'
2014-06-12 19:43:46.007 ManageEpochs:0x7f3568014120-a000000038eca1 [Txn] <INFO> Begin Txn: a000000038eca1 'Manage Epochs: Advance AHM'
2014-06-12 19:43:46.012 ManageEpochs:0x7f3568014120-a000000038eca1 [Catalog] <INFO> Checking for missed alter partition events
2014-06-12 19:43:46.012 ManageEpochs:0x7f3568014120-a000000038eca1 [Catalog] <INFO> Found no missed alter partition events
2014-06-12 19:43:46.012 ManageEpochs:0x7f3568014120-a000000038eca1 [Catalog] <INFO> Checking for missed restore table events
2014-06-12 19:43:46.012 ManageEpochs:0x7f3568014120-a000000038eca1 [Catalog] <INFO> Found no missed restore table events
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: node current cpe is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.SUBSCRIBER_DEVICE_MAPPING_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.SUBSCRIBER_DEVICE_MAPPING_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.DEVICE_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.DEVICE_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.SUBSCRIBER_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.SUBSCRIBER_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.ACCESS_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.ACCESS_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.DATE_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.DATE_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.TIME_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.TIME_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.SPEED_CLASSIFICATION_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.SPEED_CLASSIFICATION_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.PROGRAM_RATING_DIMENSION_b1, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [Recover] <INFO> Get Local Node LGE: projection spdr.PROGRAM_RATING_DIMENSION_b0, CPE is 0x1a293
2014-06-12 19:43:46.017 ManageEpochs:0x7f3568014120-a000000038eca1 [VMPI] <INFO> GetClusterLGE: My local node LGE = 0x1a293
2014-06-12 19:43:46.023 ManageEpochs:0x7f3568014120-a000000038eca1 [Txn] <INFO> Rollback Txn: a000000038eca1 'Manage Epochs: Advance AHM'
2014-06-12 19:43:46.026 ManageEpochs:0x7f3568014120 [Util] <INFO> Task 'ManageEpochs' enabled
2014-06-12 19:43:46.057 AnalyzeRowCount:0x7f35680147c0-a000000038eca3 [Txn] <INFO> Rollback Txn: a000000038eca3 'getRowCountsForProj'
2014-06-12 19:43:46.083 AnalyzeRowCount:0x7f35680147c0-a000000038eca4 [Txn] <INFO> Begin Txn: a000000038eca4 'do_analyze_row_count'
2014-06-12 19:43:46.087 AnalyzeRowCount:0x7f35680147c0-a000000038eca4 [Command] <INFO> TMTask: row count analyze - start
2014-06-12 19:43:46.091 AnalyzeRowCount:0x7f35680147c0-a000000038eca4 [Command] <INFO> TMTask: row count analyze - Done
2014-06-12 19:43:46.091 AnalyzeRowCount:0x7f35680147c0-a000000038eca4 [Txn] <INFO> Rollback Txn: a000000038eca4 'do_analyze_row_count'
2014-06-12 19:43:46.093 AnalyzeRowCount:0x7f35680147c0 [Util] <INFO> Task 'AnalyzeRowCount' enabled
2014-06-12 19:43:56.001 LowDiskSpaceCheck:0x7f3568014120-a000000038eca5 [Txn] <INFO> Begin Txn: a000000038eca5 'LowDiskSpaceCheck'
2014-06-12 19:43:56.001 LowDiskSpaceCheck:0x7f3568014120-a000000038eca5 [Txn] <INFO> Rollback Txn: a000000038eca5 'LowDiskSpaceCheck'
2014-06-12 19:43:56.001 LowDiskSpaceCheck:0x7f3568014120 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:44:06.001 LowDiskSpaceCheck:0x7f35680150a0-a000000038eca6 [Txn] <INFO> Begin Txn: a000000038eca6 'LowDiskSpaceCheck'
2014-06-12 19:44:06.001 LowDiskSpaceCheck:0x7f35680150a0-a000000038eca6 [Txn] <INFO> Rollback Txn: a000000038eca6 'LowDiskSpaceCheck'
2014-06-12 19:44:06.002 LowDiskSpaceCheck:0x7f35680150a0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:44:16.002 LowDiskSpaceCheck:0x7f3568014120-a000000038eca7 [Txn] <INFO> Begin Txn: a000000038eca7 'LowDiskSpaceCheck'
2014-06-12 19:44:16.002 LowDiskSpaceCheck:0x7f3568014120-a000000038eca7 [Txn] <INFO> Rollback Txn: a000000038eca7 'LowDiskSpaceCheck'
2014-06-12 19:44:16.003 LowDiskSpaceCheck:0x7f3568014120 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:44:26.001 LowDiskSpaceCheck:0x7f3568012b30-a000000038eca8 [Txn] <INFO> Begin Txn: a000000038eca8 'LowDiskSpaceCheck'
2014-06-12 19:44:26.001 LowDiskSpaceCheck:0x7f3568012b30-a000000038eca8 [Txn] <INFO> Rollback Txn: a000000038eca8 'LowDiskSpaceCheck'
2014-06-12 19:44:26.002 LowDiskSpaceCheck:0x7f3568012b30 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:44:36.001 LowDiskSpaceCheck:0x7f35680150a0-a000000038eca9 [Txn] <INFO> Begin Txn: a000000038eca9 'LowDiskSpaceCheck'
2014-06-12 19:44:36.001 LowDiskSpaceCheck:0x7f35680150a0-a000000038eca9 [Txn] <INFO> Rollback Txn: a000000038eca9 'LowDiskSpaceCheck'
2014-06-12 19:44:36.001 LowDiskSpaceCheck:0x7f35680150a0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:44:46.002 LowDiskSpaceCheck:0x7f35680127c0-a000000038ecaa [Txn] <INFO> Begin Txn: a000000038ecaa 'LowDiskSpaceCheck'
2014-06-12 19:44:46.003 LowDiskSpaceCheck:0x7f35680127c0-a000000038ecaa [Txn] <INFO> Rollback Txn: a000000038ecaa 'LowDiskSpaceCheck'
2014-06-12 19:44:46.003 LowDiskSpaceCheck:0x7f35680127c0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled
2014-06-12 19:44:46.005 AnalyzeRowCount:0x7f3568014940-a000000038ecab [Txn] <INFO> Begin Txn: a000000038ecab 'getRowCountsForProj'
2014-06-12 19:44:46.047 AnalyzeRowCount:0x7f3568014940-a000000038ecab [Txn] <INFO> Rollback Txn: a000000038ecab 'getRowCountsForProj'
2014-06-12 19:44:46.072 AnalyzeRowCount:0x7f3568014940-a000000038ecac [Txn] <INFO> Begin Txn: a000000038ecac 'do_analyze_row_count'
2014-06-12 19:44:46.076 AnalyzeRowCount:0x7f3568014940-a000000038ecac [Command] <INFO> TMTask: row count analyze - start
2014-06-12 19:44:46.080 AnalyzeRowCount:0x7f3568014940-a000000038ecac [Command] <INFO> TMTask: row count analyze - Done
2014-06-12 19:44:46.080 AnalyzeRowCount:0x7f3568014940-a000000038ecac [Txn] <INFO> Rollback Txn: a000000038ecac 'do_analyze_row_count'
2014-06-12 19:44:46.082 AnalyzeRowCount:0x7f3568014940 [Util] <INFO> Task 'AnalyzeRowCount' enabled
2014-06-12 19:44:56.001 LowDiskSpaceCheck:0x7f35680147c0-a000000038ecad [Txn] <INFO> Begin Txn: a000000038ecad 'LowDiskSpaceCheck'
2014-06-12 19:44:56.001 LowDiskSpaceCheck:0x7f35680147c0-a000000038ecad [Txn] <INFO> Rollback Txn: a000000038ecad 'LowDiskSpaceCheck'
2014-06-12 19:44:56.002 LowDiskSpaceCheck:0x7f35680147c0 [Util] <INFO> Task 'LowDiskSpaceCheck' enabled

Sajan_Sangraula · June 2014

Hi Siddarth,
Does the log you provided correspond to the time when the node goes down? I dont see any shotdown signal, or fatal errors. Instead of doing the last 50 rows, can you provide the log when the database goes down?

You can look for potential operating system/ hardware issues at /var/log/messages (or similar based on your OS). Does the same node goes down all the time? Also take a look at space available in the data/ catalog drive.

/Sajan

Siddarth · June 2014

Hi Sajan,

Yes these are the logs that we got after our node went down. I have checked all Vertica logs in every node and also the system logs under /var/log, but there is nothing in logs. I also checked kernel logs and dmesg but in its of no use.

The node on which we run our load script goes down. I also think that it is more of a hardware/OS issue rather than Vertica, but not sure how to confirm it.

Best Regards,
Siddarth

Sajan_Sangraula · June 2014

One way to isolate the problem would be to run the load script on a non-failing node, and see if that causes the node to go down. If the load script causes only one node of the cluster to go down but works well in other nodes, then you can focus on the hardware and OS.

Have you confirmed you have enough disk space in the data drives?

Siddarth · June 2014

Actually the load script is failing for every node. We have 3.8 TB of space so that's not an issue.

Sajan_Sangraula · June 2014

Couple of follow up questions to isolate the problem:
Can you tail -f /path/to/vertica.log while the script is running and post the log entries with the time when the node fails?
When you say the node goes down, is the cluster still up and running (assuming k=1) and the node status is shown as down in admintools and nodes system table? Or is the cluster unresponsive?

Siddarth · June 2014

The logs that I have pasted above are the last logs when the node went down. And those logs have no information whatsoever.
Cluster remains up and running and only the concerned node status shows down in admintools or in select * from nodes;.This is a bit strange for me as well. I haven't seen such a problem since last 3 years.

[Deleted User] · June 2014

Hi Siddarth,

Are you running the ETL script directly on the node in question, or are you running it remotely?

If you are running it directly on the node, how much memory does the ETL script itself consume? Vertica assumes (unless configured differently via resource pools) that it has access to all resources on the system; it doesn't try to account for external scripts or programs that may grow over time. If Vertica is using a lot of memory, and if the script is using a bunch of memory too, then the machine may run out of RAM. When this happens, because Vertica is typically the biggest process on the machine, Linux will often kill it to make more memory. When this happens, we have no control over it, and it is an immediate hard kill; we are unable to even write logging messages indicating that it's happening. The fix is to configure Vertica to leave sufficient memory for the ETL script, and/or to modify the ETL script to use less memory.

If you're running the ETL script remotely, what version of Vertica are you running? There was a bug some time ago in our remote-data-load logic that could cause long-running loads to error out. It won't typically cause nodes to crash, but it's something to consider, at least. The issue was, of course, promptly hotfix'ed; if you're running the latest hotfix for whatever version of Vertica you're using, then you should be fine here.

Adam

Siddarth · June 2014

Hi Adam,

We are using the latest version of Vertica with the latest hot-fix. We are running our script directly on the node. Every node has around 128 GB of memory with 32 GB swap space.

When we are running our script it is not taking more than 12 GB of memory, however the CPU utilization is close to 200%.

I am also thinking that Linux is killing it as there is nothing in the logs in case of node down.

We are planning to run few tests tomorrow so that we can isolate the scenario which is causing this issue.

Can you please give us some pointers as what to monitor when we run our job. (memory, cpu, disk space)

Regards,
Siddarth

eli_revach · June 2014

Hi ,

I see two options for debug this problem :
1) Using Vertica internal debug facility’s like :

SELECT SET_DEBUG_LOG('EE', 'ALL'); <- Enable debug mode for the execution engine .
SELECT CLEAR_DEBUG_LOG('EE', 'ALL''RESOURCEMANAGER', 'ALL'); <- Disable debug mode for the execution engine .

2)Use linux “strace” command to trace the Vertica running process , example : strace -f -p <vertica process id) .

Please note that option #1 is undocumeted and its mainly used by vertica suport .

Thanks .

eli_revach · June 2014

Typo on the clean command , it should be SELECT CLEAR_DEBUG_LOG('EE', 'ALL');

Ben_Vandiver · June 2014

In the directory above the catalog directory, you will find a log file called 'dbLog'. If the vertica.log just stops, sometimes this log has interesting things.

Siddarth · June 2014

Thanks all, the Vertica process was being killed by the Linux Kernel.
However, I just wanted to know how to avoid this scenario? I tried by configuring resource pool for the user for loading data without any success.

File Size:144 GB
Split file size: 7.7 GB
RAM: 132 GB

When I tried loading this 7.7 GB files in a loop, the node was still getting down after loading 1,2 or sometimes 3 files.

[Deleted User] · June 2014

Hm... That is odd.

Does your machine (BIOS and/or kernel) have NUMA support enabled? NUMA is a great idea in theory, but in practice we have found that the implementation in some kernels has issues where the machine will behave as if it has less available RAM than it actually has in practice.

Also, does your machine have any ulimits or cgroups limits or other sorts of kernel constraints on the resources allocated to the Vertica process? Vertica enforces its own limits via resource pools. The only way the kernel knows how to enforce these limits is (in many scenarios) by killing the offending process; so if these limits get set in the kernel rather than in Vertica, this can be a symptom.

Relatedly, of course, please do make sure that you don't have some other large ETL process running on the system and consuming system resources. It sounds like that's not the cause here, but it still might be if there's some buggy process that occasionally and very suddenly consumes a large amount of memory or other shared, limited resources.

Other than that -- you could try asking a Linux user group or mailing list? It sounds like Linux behavior is the cause. If you figure it out, it'd be great if you could post the resolution here.

Thanks,
Adam

Mark_F · November 2014

I am having the same issue. Siddarath, were you ever able to identify the root cause?

We're Moving!

Create My New Community Account Now

Node going down when loading data

Comments

Leave a Comment