Rebalancing stuck in 0%

ChenSha · March 2016

Hi all

I add 10 nodes to my 10 nodes cluster (7.1.1-10)
After 26 hours (50TB) the rebalance have only 1 more table (form 156 tables) of 22TB.

select * from REBALANCE_TABLE_STATUS

But I see that the rebalance doesn't run, I see that the rebalance separated_percen is 0% and in locks table I'm getting:

And in session table I see that the statement_id is null were the current_statement is "select rebalance_cluster();"

Does anyone know why the rebalancing doesn't run?

eli_revach · March 2016

Hi .

This SQL will gave you more deeply understanding about what is goining at your rebalance task:

SELECT node_name, session_id, session_start_timestamp, description
 FROM system_sessions
 WHERE session_type = 'REBALANCE_CLUSTER'

I hope you will find it useful

Thanks

ChenSha · March 2016

Hi,

the only thing that i getting from the query:

SELECT node_name, session_id, session_start_timestamp, description
FROM system_sessions
WHERE session_type = 'REBALANCE_CLUSTER'
and description is not null

is resoults on only 1 node (node010) with the secription "Txn: 130000012212eb0 'rebalance_cluster(background)'"

And still the "select * from rebalance_projection_status" on 0% ...

ChenSha · March 2016

May be I can stop the rebalancing and rerun it?

I'm afraid that this will do more damage…

I can run CANCEL_REBALANCE_CLUSTER(), but as Vertica document says:

A rebalance operation can take some time, depending on the number of projections and the amount of data they contain. HP recommends that you allow the process to complete uninterrupted. If you must cancel the operation, call the CANCEL_REBALANCE_CLUSTER function.

eli_revach · March 2016

Hi
Check your disk space availability , rebalance need extra disk space

ChenSha · March 2016

I think I have a problem here... If i select table REBALANCE_TABLE_STATUS I get the last table who need to be rebalance:

to_separate_bytes - 666,591,574,912 (620.8GB)

to_transfer_bytes - 12,187,573,777,220 (11TB)

The free disk space that I have:

node_name | disk_space_free_gb| disk_space_used_gb| disk_space_total_gb
---------------------------------------------------------------------------------------------------
v_node0010 | 1700.03 | 3829.36 | 5529.40
v_node0009 | 1772.43 | 3756.96 | 5529.40
v_node0008 | 1767.48 | 3761.91    | 5529.40
v_node0007 | 1680.23 | 3849.17 | 5529.40
v_node0006 | 1537.57      | 3991.82 | 5529.40
v_node0005 | 1567.73 | 3961.66 | 5529.40
v_node0004 | 1645.27 | 3884.12 | 5529.40
v_node0003 | 1592.93 | 3936.46    | 5529.40
v_node0002 | 1733.18 | 3796.21 | 5529.40
v_node0001 | 1646.83 | 3882.56 | 5529.40

v_node0011 | 3809.11 | 1720.29 | 5529.40
v_node0012 | 3967.53 | 1561.86 | 5529.40
v_node0013 | 3889.65 | 1639.74 | 5529.40
v_node0014 | 3920.00 | 1609.39 | 5529.40
v_node0015 | 3974.78 | 1554.61 | 5529.40
v_node0016 | 3933.21 | 1596.18 | 5529.40
v_node0017 | 3932.19 | 1597.20 | 5529.40
v_node0018 | 3974.50 | 1554.89 | 5529.40
v_node0019 | 3966.00 | 1563.39 | 5529.40
v_node0020 | 3952.15 | 1577.25 | 5529.40

The table spread on nodes 1-10 and I need it to rebalace at the new nodes too (11-20)

Is that a problem?

If so... what can I do?

eli_revach · March 2016

Hi

Best practices is 40% available free disk space , otherwise rebuild should be very slow and process the task in many small phases until completed .

Someing to considure :

rebalance allocated extra I/O and NET resources , you can easily monitor your rebalance task by monitoring your Net and I/O subsystems , using vioperf and netperf utilities , this will give you indication if the process is hung or executed

I hope you will find it useful

Thanks

ChenSha · March 2016

Hi Eli, thank you for your answers.

Couple of things:

40% free disk space where? In the first 10 nodes (where the table is)? In the 10 new nodes (the ones I added)? Because I have 30% free disk space on each node of the first 10 nodes and 50% free disk space on each node of the 10 new nodes.
By meaning "very slow"… it is passable that after 3 days the rebalance separated_percent still 0%?
In my situation, there is a way that I can do the rebalance? Speed it up? How can I handle it?
How can I monitor my rebalance task using vioperf and netperf utilities? I know those utilities, but I’ll more than happy to hear how I can follow the task using the result of the utilities.

Thank you very much for help

Chen

ChenSha · March 2016

Ok... after 3 days (74 hours) all tables are rebalanced... done

eli_revach · March 2016

Chen ,

Looks like that in term of Disk space you are Ok .

Unix utilities will gave you indication if Vertica is running the rebalance or it just hunging ( assuming no other activities is taking place in your cluster during the rebalance ) , eg: if you see massive I/O activities on the / data FS (df -h is also an option for you )

More options to monitor progress :

1)rebalance is refreshing projections you can take a look on dc_projection_checkpoint_epochs to see if you have new epochs create for your projections .

2)Take the dc_rebalanced_projections transaction_id statement_id values which are assigen to your task and query execution_engine_profiles table , this will show real time active stats.

Thanks .

We're Moving!

Create My New Community Account Now

Rebalancing stuck in 0%

Comments

Leave a Comment