AHM and LGE issue

ankir · June 2015

Our client has LGE=AMH which is more than 2 months back in time.

What can be the reason LGE is not advancing from AHM?

Rollback to LGE does not make any sense because too much data wd be lost.

Wd MAKE_AHM_NOW be the right solution or we shd investigate and find the reason why LGE is so old?

Is MAKE_AHM_NOW safe to issue or we shd check smth else beforehand?

revolution=> select * from SYSTEM;

-[ RECORD 1 ]------------+--------------

current_epoch | 9530615

ahm_epoch | 5735993

last_good_epoch | 5735993

AHM = 01.03.2015

ankir · June 2015

just found in their log some errors.

looks like MOve out fails because of too many partitions and can't change LGE.

colin_loghin · June 2015

Old LGE is a sign that some data is still lingering in memory (WOS) on one node and that keeps both LGE and AHM from advancing. You could use

select do_tm_task('moveout');

select * from projection_storage where wos_used_bytes > 0; and check which tables contain data in memory after running a moveout. There might be something preventing the data purge onto disk and this in turn can keep LGE from advancing.

Under no circumstances recover from LGE ,because you will lose all data loaded since that epoch.

This is just an investigation start, hopefully it will yield a helpful thread for you.

Regards,

colin_loghin · June 2015

ok, the partitioning is too granular and that internally leads into repetitive moveout failures. Once partitioning is addressed by increasing the grain ( coarser) then dowstream problems should disappear. This is unfortunately one use case where a single table can upset behavior for the entire cluster . Should cluster fail now, you would need to restart from LGE.

ankir · June 2015

Thanks a lot!

this is what i found in their log:

EventCodeDe scription: Timer Service Task Error ProblemDescription: threadShim: Too many data partitions DatabaseName: revolution Hostname: ... 2015-06-06 12:10:21.034 TM Moveout:0x7fd1fc010c50 <ERROR> ... 54000/5060: Too many data partitions HINT: Verify that the table partitioning expression is correct LOCATION: handlePartitionKey, /scratch_a/release/vbuild/vertica/EE/Operators/DataTarget.cpp

apart from data stuck on WOS can this error cause the same problem with advancing LGE?

Adrian_Oprea_1 · June 2015

If data cannot be moved from WOS to ROS then you have a problem.

Sometimes in order to keep your cluster up and consisten you will need to drop the object(table) responsible for the hold and rebuild it ! so that the LGE would advance.

To many partitoins endup in to many containers !!! not good ! to much work to be done by your database.

Resolution for you might be :

Create a new table with the right partitioning.
Reload the data using and export into a csv file and them bulk import or Select .. Insert operations.
Drop the original table.
Rename the new table to the old table.

Make sure after that you LGE is up to date.

ankir · June 2015

Adrian, thanks a lot.

we will try your recommendations.

We're Moving!

Create My New Community Account Now

AHM and LGE issue

Comments

Leave a Comment