Tuple Mover: moving out projections
I found 30G of the below being repeated. Any idea what's going on and how it can be avoided?
2015-10-09 09:02:44.632 TM Moveout:0x7fcf30112620 [Txn] <INFO> Commit Complete: Txn: a00000005d876a at epoch 0x5fb22
2015-10-09 09:02:44.632 TM Moveout:0x7fcf30112620 [TM] <INFO> Tuple Mover: moved out projection qr_concept_hashtags_tweets_b0
2015-10-09 09:02:44.632 TM Moveout:0x7fcf30112620-a00000005d876b [Txn] <INFO> Begin Txn: a00000005d876b 'Moveout: Tuple Mover'
2015-10-09 09:02:44.632 TM Moveout:0x7fcf30112620-a00000005d876b [TM] <INFO> Tuple Mover: moving out projection qr_concept_hashtags_tweets_b0
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) Moveout projection ecr.qr_concept_hashtags_tweets_b0
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) TM Moveout: Moving out DV data in WOS up to epoch 391970, based on WOS data up to epoch 391969
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) TM moveout: Wos row count = 0, Wos delete row count = 0
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) Skipping replay delete due to lack of deletes.
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) Executing DVWos moveout plans
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) getMiniROSsForDVWosMoveout: Creating DVWos marker(s)
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) getMiniROSsForDVWosMoveout: No unmarked DVWOSs to moveout.
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) Deleting Wos and DV contents
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) Dropping source WOSs - empty
2015-10-09 09:02:44.633 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> (a00000005d876b) Moved out 0 bytes
2015-10-09 09:02:44.634 TM Moveout:0x7fcf30112620-a00000005d876b [EE] <INFO> Moveout projection ecr.qr_concept_hashtags_tweets_b0 - done
2015-10-09 09:02:44.634 TM Moveout:0x7fcf30112620-a00000005d876b [Txn] <INFO> Starting Commit: Txn: a00000005d876b 'Moveout: (Table: ecr.qr_concept_hashtags_tweets) (Projection: ecr.qr_concept_hashtags_tweets_b0)'
Comments
Why would you wanna avoid this ? That is the Tuple Mover doing his job, moving data from WOS into ROS.(don`t worry there is nothing wrong going on)
See ths link for more datails .
Or are you reffering to the INFO msg in your vertica.log ? (you don`t want them anymore?)
Is it normal that it would be performing this operation for many days and also that it would cause 30G of output?
It looks like vertica shutdown because of it: Total memory accounted for by FileCache Allocator
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 536870920: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 536870920: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 536870920: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 536870920: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 536870920: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 1073741832: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 1073741832: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 1073741832: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 1073741832: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 1073741832: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 1073741832: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 1073741832: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.029 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/3916: Memory usage in FileCache:
DETAIL: Size 1073741832: 0 on free list; 0 still in use (0 bytes)
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 <LOG> @v_vjanys_node0001: 00000/5071: Total memory accounted for by FileCache Allocator: 0
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Typical LRU usage: 0 free 0 in use
2015-10-10 02:37:06.030 unknown:0x7fcfd4d45700 [SAL] <INFO> Large LRU usage: 0 free 0 in use
2015-10-10 02:37:06.033 unknown:0x7fcfd4d45700 [Init] <INFO> Global pool memory usage: NewPool(0x4735320) 'GlobalPool': totalDtors 0 totalSize 132120576 (17364488 unused) totalChunks 6
2015-10-10 02:37:06.033 unknown:0x7fcfd4d45700 [Init] <INFO> SAL global pool memory usage: NewPool(0x4725380) 'SALGlobalPool': totalDtors 0 totalSize 3395289088 (2065753376 unused) totalChunks 47
2015-10-10 02:37:06.033 unknown:0x7fcfd4d45700 [Init] <INFO> SS::stopPoller()
2015-10-10 02:37:06.034 unknown:0x7fcfd4d45700 [Init] <INFO> DC::shutDown()
2015-10-10 02:37:06.034 unknown:0x7fcfd4d45700 [Init] <INFO> Shutdown complete. Exiting.
When you talk about 30G output are you reffering to the size of your vertica.log file ?
Yes, usually the vertica.log file (until it is rotated) is around 1M
I don`t know if the error in the log is related to the fact that your log is huge in size.
But what you can do is check you logrotate conf and make sure your set it up for your use.
Example of how i setup mine:
-d - pass the database name
-r - rotation sequence
-k - how many should be kept
- this should keep your space waste to a certain control.
It does seem that vertica crashed because of tuple mover operation. See the above post for the error: "Total memory accounted for by FileCache Allocator"
You are mixing problems ! You were facing huge log file issues, now you are looking at an Vertica crash error !
One at a time. maybe one is the couse of another or they might have nothing to doone with another !
Restart your DB setup your logrotate properly and monitor your db.
Yes, I've had three crashes in the past week and they all have had large logs as a warning sign (i.e. I can tell when a crash is going to occur when the vertica log starts expanding beyond what it should be). If it sounds like there are multiple issues to this thread it's because I'm still investigating the cause of these crashes. Initially, I found that the crash was preceded by 30G of tuper mover operations (which is unusual). Ultimately, it seems like vertica is hitting a memory issue:
Total memory accounted for by FileCache Allocator
It's a three cluster set up and one is recovering. How can I (without reinstalling) erase all the data and all the history (all the epochs) and get back to square one WITH my schemas? I believe this would solve my problems since vertica is trying to get back to an epoch during recovery while data is constantly being added and deleted.
So you have one node in recovery state all this time ?
Can you see the output of this query:
Also :
Maybe the node is not coming back becouse there are projecitons stuck in recovery;
What is the output of this queries ?
=> select get_ahm_epoch(),get_last_good_epoch(),get_current_epoch();
get_ahm_epoch | get_last_good_epoch | get_current_epoch
---------------+---------------------+-------------------
368004 | 398028 | 398029
(1 row)
=> select * from recovery_status where is_running;
node_name | recover_epoch | recovery_phase | splits_completed | splits_total | historical_completed | historical_total | current_completed | current_total | is_running
-------------------+---------------+-------------------+------------------+--------------+----------------------+------------------+-------------------+---------------+------------
v_vjanys_node0003 | 395993 | historical pass 1 | 0 | 0 | 0 | 30 | 0 | 0 | t
(1 row)
The query "select * from vs_projection_recoveries WHERE status not in ('finished', 'ignored') AND start_time > '2015-10-14';" results in 30 rows
=> select status, count(*) from vs_projection_recoveries WHERE status not in ('finished', 'ignored') AND start_time > '2015-10-14' group by status;
status | count
-------------+-------
error-fatal | 28
running | 2
(2 rows)
What is the return of the :
And also :
=> SELECT node_name, projection_name, method, status, progress, detail, start_time
-> FROM projection_recoveries where status = 'error-fatal';
node_name | projection_name | method | status | progress | detail | start_time
-------------------+-----------------------------------+---------------------------+-------------+----------+--------+-------------------------------
v_vg_node0002 | ecr.qr_b0 | incremental-replay-delete | error-fatal | | | 2015-06-27 03:31:03.966205+00
v_vg_node0002 | ecr.qr_b1 | incremental-replay-delete | error-fatal | | | 2015-06-27 03:32:09.275683+00
v_vg_node0002 | ecr.qr_cp_ht_b0 | | error-fatal | | | 2015-06-30 05:46:43.562144+00
v_vg_node0002 | ecr.qr_cp_ht_b1 | | error-fatal | | | 2015-06-30 05:46:43.639671+00
v_vg_node0002 | ecr.qr_cp_tk_b0 | | error-fatal | | | 2015-06-30 05:46:43.64275+00
v_vg_node0002 | ecr.qr_cp_tk_b1 | | error-fatal | | | 2015-06-30 05:46:43.64573+00
v_vg_node0002 | ecr.qr_cps_b0 | | error-fatal | | | 2015-06-30 05:46:43.648709+00
v_vg_node0002 | ecr.qr_cps_b1 | | error-fatal | | | 2015-06-30 05:46:43.651725+00
v_vg_node0002 | ecr.qr_u_cm_b0 | | error-fatal | | | 2015-06-30 05:46:43.654681+00
v_vg_node0002 | ecr.qr_u_cm_b1 | | error-fatal | | | 2015-06-30 05:46:43.657712+00
v_vg_node0002 | ecr.qr_u_in_b0 | | error-fatal | | | 2015-06-30 05:46:43.660662+00
v_vg_node0002 | ecr.qr_u_in_b1 | | error-fatal | | | 2015-06-30 05:46:43.663598+00
v_vg_node0002 | ecr.ct_b0 | | error-fatal | | | 2015-06-30 05:46:43.66651+00
v_vg_node0002 | ecr.ct_b1 | | error-fatal | | | 2015-06-30 05:46:43.669569+00
v_vg_node0002 | ecr.ct_tk_b0 | | error-fatal | | | 2015-06-30 05:46:43.67267+00
v_vg_node0002 | ecr.ct_tk_b1 | | error-fatal | | | 2015-06-30 05:46:43.675722+00
v_vg_node0002 | ecr.ct_ht_b0 | | error-fatal | | | 2015-06-30 05:46:43.678729+00
v_vg_node0002 | ecr.ct_ht_b1 | | error-fatal | | | 2015-06-30 05:46:43.681869+00
v_vg_node0002 | ecr.ct_basic_b0 | | error-fatal | | | 2015-06-30 05:46:43.684977+00
v_vg_node0002 | ecr.ct_basic_b1 | | error-fatal | | | 2015-06-30 05:46:43.68796+00
v_vg_node0002 | ecr.ct_basic_tk_b0 | | error-fatal | | | 2015-06-30 05:46:43.690908+00
v_vg_node0002 | ecr.ct_basic_tk_b1 | | error-fatal | | | 2015-06-30 05:46:43.69391+00
v_vg_node0002 | ecr.ct_basic_ht_b0 | | error-fatal | | | 2015-06-30 05:46:43.696908+00
v_vg_node0002 | ecr.ct_basic_ht_b1 | | error-fatal | | | 2015-06-30 05:46:43.700011+00
v_vg_node0002 | ecr.qr_pt_b1 | | error-fatal | | | 2015-06-30 05:46:43.707813+00
v_vg_node0002 | ecr.qr_pt_u_cm_b0 | | error-fatal | | | 2015-06-30 05:46:43.7126+00
v_vg_node0002 | ecr.qr_pt_u_cm_b1 | | error-fatal | | | 2015-06-30 05:46:43.716197+00
v_vg_node0002 | ecr.qr_pt_u_in_b0 | | error-fatal | | | 2015-06-30 05:46:43.719802+00
v_vg_node0002 | ecr.qr_pt_u_in_b1 | | error-fatal | | | 2015-06-30 05:46:43.723406+00
v_vg_node0002 | ecr.qr_pt_cp_ht_b0 | | error-fatal | | | 2015-06-30 05:46:43.727033+00
v_vg_node0002 | ecr.qr_pt_cp_ht_b1 | | error-fatal | | | 2015-06-30 05:46:43.730709+00
v_vg_node0003 | ecr.qr_b1 | incremental-replay-delete | error-fatal | | | 2015-10-12 06:21:50.847503+00
v_vg_node0003 | ecr.qr_b0 | incremental-replay-delete | error-fatal | | | 2015-10-12 06:21:50.849369+00
v_vg_node0003 | ecr.qr_cp_ht_b0 | | error-fatal | | | 2015-10-14 02:25:29.372474+00
v_vg_node0003 | ecr.qr_cp_ht_b1 | | error-fatal | | | 2015-10-14 02:25:29.377422+00
v_vg_node0003 | ecr.qr_cp_tk_b0 | | error-fatal | | | 2015-10-14 02:25:29.462157+00
v_vg_node0003 | ecr.qr_cp_tk_b1 | | error-fatal | | | 2015-10-14 02:25:29.483863+00
v_vg_node0003 | ecr.qr_cps_b0 | | error-fatal | | | 2015-10-14 02:25:29.49422+00
v_vg_node0003 | ecr.qr_cps_b1 | | error-fatal | | | 2015-10-14 02:25:29.496237+00
v_vg_node0003 | ecr.qr_u_cm_b0 | | error-fatal | | | 2015-10-14 02:25:29.498551+00
v_vg_node0003 | ecr.qr_u_cm_b1 | | error-fatal | | | 2015-10-14 02:25:29.500917+00
v_vg_node0003 | ecr.qr_u_in_b0 | | error-fatal | | | 2015-10-14 02:25:29.501604+00
v_vg_node0003 | ecr.qr_u_in_b1 | | error-fatal | | | 2015-10-14 02:25:29.503817+00
v_vg_node0003 | ecr.ct_b0 | | error-fatal | | | 2015-10-14 02:25:29.505858+00
v_vg_node0003 | ecr.ct_b1 | | error-fatal | | | 2015-10-14 02:25:29.602674+00
v_vg_node0003 | ecr.ct_ht_b0 | | error-fatal | | | 2015-10-14 02:25:29.605933+00
v_vg_node0003 | ecr.ct_ht_b1 | | error-fatal | | | 2015-10-14 02:25:29.605477+00
v_vg_node0003 | ecr.ct_basic_b0 | | error-fatal | | | 2015-10-14 02:25:29.607644+00
v_vg_node0003 | ecr.ct_basic_b1 | | error-fatal | | | 2015-10-14 02:25:29.609904+00
v_vg_node0003 | ecr.ct_basic_ht_b0 | | error-fatal | | | 2015-10-14 02:25:29.611451+00
v_vg_node0003 | ecr.ct_basic_ht_b1 | | error-fatal | | | 2015-10-14 02:25:29.61236+00
v_vg_node0003 | ecr.qr_pt_b0 | | error-fatal | | | 2015-10-14 02:25:29.61452+00
v_vg_node0003 | ecr.qr_pt_b1 | | error-fatal | | | 2015-10-14 02:25:29.615586+00
v_vg_node0003 | ecr.qr_pt_u_cm_b0 | | error-fatal | | | 2015-10-14 02:25:29.622902+00
v_vg_node0003 | ecr.qr_pt_u_cm_b1 | | error-fatal | | | 2015-10-14 02:25:29.639683+00
v_vg_node0003 | ecr.qr_pt_u_in_b0 | | error-fatal | | | 2015-10-14 02:25:29.640807+00
v_vg_node0003 | ecr.qr_pt_u_in_b1 | | error-fatal | | | 2015-10-14 02:25:29.641984+00
v_vg_node0003 | ecr.qr_pt_cp_ht_b0 | | error-fatal | | | 2015-10-14 02:25:29.644083+00
v_vg_node0003 | ecr.qr_pt_cp_ht_b1 | | error-fatal | | | 2015-10-14 02:25:29.646396+00
v_vg_node0003 | ecr.qr_cp_ht_t_b0 | | error-fatal | | | 2015-10-14 02:25:29.647649+00
v_vg_node0003 | ecr.qr_cp_ht_t_b1 | | error-fatal | | | 2015-10-14 02:25:29.649551+00
(62 rows)
=> SELECT node_name, projection_name, method, status, progress, detail, start_time
dbadmin-> FROM projection_recoveries where status = 'running';
node_name | projection_name | method | status | progress | detail | start_time
-------------------+-----------------+---------------------------+---------+----------+---------------------------+-------------------------------
v_vg_node0003 | ecr.qr_b0 | incremental-replay-delete | running | 37 | Delete:26574848/517714984 | 2015-10-14 03:10:18.976988+00
v_vg_node0003 | ecr.qr_b1 | incremental-replay-delete | running | 37 | Delete:27181056/488836330 | 2015-10-14 03:10:18.974554+00
(2 rows)
Try to run :
- to skip the reply part and do recovery. This will force the AHM to advance even with one of yor nodes not up state.
- This will allow the node to be rebuild from scratch.
See more details about this function here.
Let me know the output.
=> select make_ahm_now(true);
make_ahm_now
---------------------------------
AHM set (New AHM Epoch: 400056)
(1 row)
Is there a way I can start back at epoch 0? Without reinstalling, is there a way I can delete all the data and all the history and get back to how the system was when I first installed it while keeping my current schemas and projections?
What is the recovery status now ?
Try to restart the bad node !