Rebalancing stuck in 0%
Hi all
I add 10 nodes to my 10 nodes cluster (7.1.1-10)
After 26 hours (50TB) the rebalance have only 1 more table (form 156 tables) of 22TB.
select * from REBALANCE_TABLE_STATUS
But I see that the rebalance doesn't run, I see that the rebalance separated_percen is 0% and in locks table I'm getting:
And in session table I see that the statement_id is null were the current_statement is "select rebalance_cluster();"
Does anyone know why the rebalancing doesn't run?
0
Comments
Hi .
This SQL will gave you more deeply understanding about what is goining at your rebalance task:
I hope you will find it useful
Thanks
Hi,
the only thing that i getting from the query:
SELECT node_name, session_id, session_start_timestamp, description
FROM system_sessions
WHERE session_type = 'REBALANCE_CLUSTER'
and description is not null
is resoults on only 1 node (node010) with the secription "Txn: 130000012212eb0 'rebalance_cluster(background)'"
And still the "select * from rebalance_projection_status" on 0% ...
May be I can stop the rebalancing and rerun it?
I'm afraid that this will do more damage…
I can run CANCEL_REBALANCE_CLUSTER(), but as Vertica document says:
A rebalance operation can take some time, depending on the number of projections and the amount of data they contain. HP recommends that you allow the process to complete uninterrupted. If you must cancel the operation, call the CANCEL_REBALANCE_CLUSTER function.
Hi
Check your disk space availability , rebalance need extra disk space
I think I have a problem here... If i select table REBALANCE_TABLE_STATUS I get the last table who need to be rebalance:
to_separate_bytes - 666,591,574,912 (620.8GB)
to_transfer_bytes - 12,187,573,777,220 (11TB)
The free disk space that I have:
node_name | disk_space_free_gb| disk_space_used_gb| disk_space_total_gb
---------------------------------------------------------------------------------------------------
v_node0010 | 1700.03 | 3829.36 | 5529.40
v_node0009 | 1772.43 | 3756.96 | 5529.40
v_node0008 | 1767.48 | 3761.91 | 5529.40
v_node0007 | 1680.23 | 3849.17 | 5529.40
v_node0006 | 1537.57 | 3991.82 | 5529.40
v_node0005 | 1567.73 | 3961.66 | 5529.40
v_node0004 | 1645.27 | 3884.12 | 5529.40
v_node0003 | 1592.93 | 3936.46 | 5529.40
v_node0002 | 1733.18 | 3796.21 | 5529.40
v_node0001 | 1646.83 | 3882.56 | 5529.40
v_node0011 | 3809.11 | 1720.29 | 5529.40
v_node0012 | 3967.53 | 1561.86 | 5529.40
v_node0013 | 3889.65 | 1639.74 | 5529.40
v_node0014 | 3920.00 | 1609.39 | 5529.40
v_node0015 | 3974.78 | 1554.61 | 5529.40
v_node0016 | 3933.21 | 1596.18 | 5529.40
v_node0017 | 3932.19 | 1597.20 | 5529.40
v_node0018 | 3974.50 | 1554.89 | 5529.40
v_node0019 | 3966.00 | 1563.39 | 5529.40
v_node0020 | 3952.15 | 1577.25 | 5529.40
The table spread on nodes 1-10 and I need it to rebalace at the new nodes too (11-20)
Is that a problem?
If so... what can I do?
Hi
Best practices is 40% available free disk space , otherwise rebuild should be very slow and process the task in many small phases until completed .
Someing to considure :
rebalance allocated extra I/O and NET resources , you can easily monitor your rebalance task by monitoring your Net and I/O subsystems , using vioperf and netperf utilities , this will give you indication if the process is hung or executed
I hope you will find it useful
Thanks
Hi Eli, thank you for your answers.
Couple of things:
Thank you very much for help
Chen
Ok... after 3 days (74 hours) all tables are rebalanced... done
Chen ,
Looks like that in term of Disk space you are Ok .
Unix utilities will gave you indication if Vertica is running the rebalance or it just hunging ( assuming no other activities is taking place in your cluster during the rebalance ) , eg: if you see massive I/O activities on the / data FS (df -h is also an option for you )
More options to monitor progress :
1)rebalance is refreshing projections you can take a look on dc_projection_checkpoint_epochs to see if you have new epochs create for your projections .
2)Take the dc_rebalanced_projections transaction_id statement_id values which are assigen to your task and query execution_engine_profiles table , this will show real time active stats.
Thanks .