Vertica crash on AWS when applied partitions
Need help in debugging an issue with Vertica on AWS.We have set up a 3 instance cluster on AWS with t2 large across all nodes while trying to partition the table facing an issue that shuts down Vertica database abruptly, there are no significant messages recorded on /var/log/messages file.
messages file around the shutdown on
Node-1
Nov 9 17:21:04 ip-10.0.0.1 su: (to dbadmin) ec2-user on pts/1
Nov 9 17:21:45 ip- 10.0.0.1 dhclient[3217]: XMT: Solicit on eth0, interval 108330ms.
Nov 9 17:23:34 ip- 10.0.0.1 dhclient[3217]: XMT: Solicit on eth0, interval 120900ms.
Nov 9 17:25:35 ip- 10.0.0.1 dhclient[3217]: XMT: Solicit on eth0, interval 113820ms.
Nov 9 17:27:29 ip- 10.0.0.1 dhclient[3217]: XMT: Solicit on eth0, interval 117800ms.
Nov 9 17:29:27 ip- 10.0.0.1 dhclient[3217]: XMT: Solicit on eth0, interval 119340ms.
Nov 9 17:30:01 ip- 10.0.0.1 systemd: Created slice User Slice of root.
Node-2
Nov 9 17:20:01 ip-10.0.0.2 systemd: Created slice User Slice of root.
Nov 9 17:20:01 ip-10.0.0.2 systemd: Starting User Slice of root.
Nov 9 17:20:01 ip-10.0.0.2 systemd: Started Session 51 of user root.
Nov 9 17:20:01 ip-10.0.0.2 systemd: Starting Session 51 of user root.
Nov 9 17:20:01 ip-10.0.0.2 systemd: Removed slice User Slice of root.
Nov 9 17:20:01 ip-10.0.0.2 systemd: Stopping User Slice of root.
Nov 9 17:20:20 ip-10.0.0.2 dhclient[3189]: XMT: Solicit on eth0, interval 110860ms.
Nov 9 17:22:03 ip-10.0.0.2 systemd-logind: Removed session 40.
Node-3
Nov 9 17:20:01 ip-10.0.0.3 systemd: Stopping User Slice of root.
Nov 9 17:21:14 ip-10.0.0.3 systemd-logind: Removed session 40.
Nov 9 17:21:14 ip-10.0.0.3 systemd: Removed slice User Slice of dbadmin.
Nov 9 17:21:14 ip-10.0.0.3 systemd: Stopping User Slice of dbadmin.
Instance Type Diskspace Swap Size ulimit -n
Node-1(Initiator) t2 large 130 GB 10 GB 65536
Node-2 t2 large 30 GB 10 GB 65536
Node-3 t2 large 30 GB 10 GB 65536
The error is similar for 1GB of data and 10 GB while partitioning any table
records from DC_ERRORS table for that session
/SELECT event_timestamp,node_name,user_name,session_id,error_level,error_code,message,hint
FROM error_messages
where session_id = 'v_tpcds_db2_node0001-31783:0x40b';/
event_timestamp node_name user_name session_id error_level error_code message hint
09-11-2020 17:21 v_tpcds_db2_node0001 dbadmin v_tpcds_db2_node0001-31783:0x40b NOTICE 0 The new partitioning scheme will produce partitions in 72 physical storage containers per projection
09-11-2020 17:21 v_tpcds_db2_node0001 dbadmin v_tpcds_db2_node0001-31783:0x40b WARNING 64 Queries using table "web_returns" may not perform optimally since the data may not be repartitioned in accordance with the new partition expression Use "ALTER TABLE tpc1gb.web_returns REORGANIZE;" to repartition the data.
In the comment down will add the message from the vertica.log file on Node-1 during table partition .
Answers
--Node-1(Initiator)
2020-11-09 17:21:13.791 Init Session:0x7fc7d99ff700 [Session] [Query] TX:0(v_tpcds_db2_node0001-31783:0x40b) ALTER TABLE tpc1gb.web_returns
PARTITION BY ((wr_d_date)::date)
GROUP BY DATE(DATE_TRUNC('MONTH',wr_d_date));
2020-11-09 17:21:13.791 Init Session:0x7fc7d99ff700-a00000000005fa [Txn] Begin Txn: a00000000005fa 'ALTER TABLE tpc1gb.web_returns
PARTITION BY ((wr_d_date)::date)
GROUP BY DATE(DATE_TRUNC('MONTH',wr_d_date));'
2020-11-09 17:21:13.795 Init Session:0x7fc7d99ff700-a00000000005fa [Txn] Starting Commit: Txn: a00000000005fa 'ALTER TABLE tpc1gb.web_returns
PARTITION BY ((wr_d_date)::date)
GROUP BY DATE(DATE_TRUNC('MONTH',wr_d_date));' 876
2020-11-09 17:21:13.796 Init Session:0x7fc7d99ff700 [Txn] Commit Complete: Txn: a00000000005fa at epoch 0x4f and global catalog version 876
2020-11-09 17:21:13.798 Init Session:0x7fc7d99ff700-a00000000005fb [Txn] Begin Txn: a00000000005fb 'ALTER TABLE tpc1gb.web_returns
PARTITION BY ((wr_d_date)::date)
GROUP BY DATE(DATE_TRUNC('MONTH',wr_d_date));'
2020-11-09 17:21:13.804 InternalStmt:0x7fc6e712f700-a00000000005fc [Txn] Begin Txn: a00000000005fc 'parseOptimizerDirectives'
2020-11-09 17:21:13.804 InternalStmt:0x7fc6e712f700-a00000000005fc [Txn] Rollback Txn: a00000000005fc 'parseOptimizerDirectives'
2020-11-09 17:21:13.805 InternalStmt:0x7fc6e712f700 [Session] InternalStatement subsession v_tpcds_db2_node0001-31783:0x40d inherited parent session v_tpcds_db2_
node0001-31783:0x40b
2020-11-09 17:21:13.805 InternalStmt:0x7fc6e712f700-a00000000005fd [Txn] Begin Txn: a00000000005fd 'SELECT COUNT (CASE WHEN wr_d_date IS NULL THEN 1 ELSE NULL EN
D) AS wr_d_date, COUNT (DISTINCT date(date_trunc('MONTH', web_returns.wr_d_date))) FROM tpc1gb.web_returns;'
2020-11-09 17:21:13.868 InternalStmt:0x7fc6e712f700-a00000000005fd [Txn] Starting Commit: Txn: a00000000005fd 'SELECT COUNT (CASE WHEN wr_d_date IS NULL THEN 1 E
LSE NULL END) AS wr_d_date, COUNT (DISTINCT date(date_trunc('MONTH', web_returns.wr_d_date))) FROM tpc1gb.web_returns;' 876
2020-11-09 17:21:13.868 InternalStmt:0x7fc6e712f700 [Txn] Commit Complete: Txn: a00000000005fd at epoch 0x4f and global catalog version 876
2020-11-09 17:21:13.869 Init Session:0x7fc7d99ff700-a00000000005fb @v_tpcds_db2_node0001: 00000/8364: The new partitioning scheme will produce partitions in 72
physical storage containers per projection
2020-11-09 17:21:13.876 Init Session:0x7fc7d99ff700-a00000000005fb @v_tpcds_db2_node0001: 01000/4493: Queries using table "web_returns" may not perform optima
lly since the data may not be repartitioned in accordance with the new partition expression
HINT: Use "ALTER TABLE tpc1gb.web_returns REORGANIZE;" to repartition the data
2020-11-09 17:21:13.877 Init Session:0x7fc7d99ff700-a00000000005fb [Txn] Starting Commit: Txn: a00000000005fb 'ALTER TABLE tpc1gb.web_returns
PARTITION BY ((wr_d_date)::date)
GROUP BY DATE(DATE_TRUNC('MONTH',wr_d_date));' 876
2020-11-09 17:21:13.878 Init Session:0x7fc7d99ff700-a00000000005fb [Util] Task 'TM Mergeout(00)' enabled
2020-11-09 17:21:13.878 Init Session:0x7fc7d99ff700-a00000000005fb [Util] Task 'TM Mergeout(01)' enabled
2020-11-09 17:21:13.878 Init Session:0x7fc7d99ff700-a00000000005fb [Util] Task 'TM Mergeout(02)' enabled
2020-11-09 17:21:13.878 Init Session:0x7fc7d99ff700-a00000000005fb [Util] Task 'TM Mergeout(03)' enabled
2020-11-09 17:21:13.878 Init Session:0x7fc7d99ff700-a00000000005fb [Util] Task 'TM Mergeout(04)' enabled
2020-11-09 17:21:13.878 Init Session:0x7fc7d99ff700-a00000000005fb [Util] Task 'TM Mergeout(05)' enabled
2020-11-09 17:21:13.878 Init Session:0x7fc7d99ff700-a00000000005fb [Util] Task 'TM Mergeout(06)' enabled
2020-11-09 17:21:13.879 Init Session:0x7fc7d99ff700 [Txn] Commit Complete: Txn: a00000000005fb at epoch 0x4f and new global catalog version 877
2020-11-09 17:21:13.879 TM Mergeout(02):0x7fc7caec9700-a0000000000600 [Txn] Begin Txn: a0000000000600 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.879 TM Mergeout(02):0x7fc7caec9700-a0000000000600 [TM] TMService : dequeued a [MERGEOUT] request for the projection 45035996273712834
2020-11-09 17:21:13.879 TM Mergeout(04):0x7fc58cf37700-a00000000005ff [Txn] Begin Txn: a00000000005ff 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.879 TM Mergeout(04):0x7fc58cf37700-a00000000005ff [TM] TMService : dequeued a [MERGEOUT] request for the projection 45035996273712776
2020-11-09 17:21:13.880 TM Mergeout(03):0x7fc7cbecb700-a0000000000601 [Txn] Begin Txn: a0000000000601 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.880 TM Mergeout(03):0x7fc7cbecb700-a0000000000601 [Txn] Rollback Txn: a0000000000601 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.880 TM Mergeout(04):0x7fc58cf37700-a00000000005ff [Main] Handling signal: 11
2020-11-09 17:21:13.880 TM Mergeout(02):0x7fc7caec9700-a0000000000600 [TM] Has more than one job ? No, has eligible threads No, this threadId: 2, minimum stratum
# of skipped jobs 65535
2020-11-09 17:21:13.880 TM Mergeout(02):0x7fc7caec9700-a0000000000600 [Txn] Rollback Txn: a0000000000600 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.881 TM Mergeout(01):0x7fc6e712f700-a0000000000602 [Txn] Begin Txn: a0000000000602 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.881 TM Mergeout(03):0x7fc7cbecb700 [Util] Task 'TM Mergeout(03)' enabled
2020-11-09 17:21:13.881 TM Mergeout(02):0x7fc7caec9700-a0000000000603 [Txn] Begin Txn: a0000000000603 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.881 TM Mergeout(01):0x7fc6e712f700-a0000000000602 [Txn] Rollback Txn: a0000000000602 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.881 TM Mergeout(01):0x7fc6e712f700 [Util] Task 'TM Mergeout(01)' enabled
2020-11-09 17:21:13.882 TM Mergeout(02):0x7fc7caec9700-a0000000000603 [Txn] Rollback Txn: a0000000000603 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.882 TM Mergeout(02):0x7fc7caec9700 [Util] Task 'TM Mergeout(02)' enabled
2020-11-09 17:21:13.882 TM Mergeout(00):0x7fc7cb6ca700-a0000000000604 [Txn] Begin Txn: a0000000000604 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.882 TM Mergeout(00):0x7fc7cb6ca700-a0000000000604 [Txn] Rollback Txn: a0000000000604 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.882 TM Mergeout(00):0x7fc7cb6ca700 [Util] Task 'TM Mergeout(00)' enabled
2020-11-09 17:21:13.883 TM Mergeout(05):0x7fc7cc6cc700-a0000000000605 [Txn] Begin Txn: a0000000000605 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.883 TM Mergeout(05):0x7fc7cc6cc700-a0000000000605 [Txn] Rollback Txn: a0000000000605 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.883 TM Mergeout(05):0x7fc7cc6cc700 [Util] Task 'TM Mergeout(05)' enabled
2020-11-09 17:21:13.884 TM Mergeout(06):0x7fc6efd34700-a0000000000606 [Txn] Begin Txn: a0000000000606 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.884 TM Mergeout(06):0x7fc6efd34700-a0000000000606 [Txn] Rollback Txn: a0000000000606 'Mergeout: Tuple Mover'
2020-11-09 17:21:13.884 TM Mergeout(06):0x7fc6efd34700 [Util] Task 'TM Mergeout(06)' enabled
2020-11-09 17:21:14.003 MetadataPoolMonitor:0x7fc6e712f700 [ResourceManager] Update metadata resource pool memory with delta: Memory(KB): 1
2020-11-09 17:21:14.003 MetadataPoolMonitor:0x7fc6e712f700 @v_tpcds_db2_node0001: 00000/7794: Updated metadata pool: Memory(KB): 39801
2020-11-09 17:21:14.080 TM Mergeout(04):0x7fc58cf37700-a00000000005ff [Main] Received fatal signal SIGSEGV.
2020-11-09 17:21:14.080 TM Mergeout(04):0x7fc58cf37700-a00000000005ff [Main] Info: si_code: 128, si_pid: 0, si_uid: 0, si_addr: (nil)