AHM and LGE issue
Our client has LGE=AMH which is more than 2 months back in time.
What can be the reason LGE is not advancing from AHM?
Rollback to LGE does not make any sense because too much data wd be lost.
Wd MAKE_AHM_NOW be the right solution or we shd investigate and find the reason why LGE is so old?
Is MAKE_AHM_NOW safe to issue or we shd check smth else beforehand?
revolution=> select * from SYSTEM;
-[ RECORD 1 ]------------+--------------
current_epoch | 9530615
ahm_epoch | 5735993
last_good_epoch | 5735993
AHM = 01.03.2015
0
Comments
just found in their log some errors.
looks like MOve out fails because of too many partitions and can't change LGE.
Old LGE is a sign that some data is still lingering in memory (WOS) on one node and that keeps both LGE and AHM from advancing. You could use
select do_tm_task('moveout');
select * from projection_storage where wos_used_bytes > 0; and check which tables contain data in memory after running a moveout. There might be something preventing the data purge onto disk and this in turn can keep LGE from advancing.
Under no circumstances recover from LGE ,because you will lose all data loaded since that epoch.
This is just an investigation start, hopefully it will yield a helpful thread for you.
Regards,
ok, the partitioning is too granular and that internally leads into repetitive moveout failures. Once partitioning is addressed by increasing the grain ( coarser) then dowstream problems should disappear. This is unfortunately one use case where a single table can upset behavior for the entire cluster . Should cluster fail now, you would need to restart from LGE.
Thanks a lot!
this is what i found in their log:
EventCodeDe scription: Timer Service Task Error ProblemDescription: threadShim: Too many data partitions DatabaseName: revolution Hostname: ... 2015-06-06 12:10:21.034 TM Moveout:0x7fd1fc010c50 <ERROR> ... 54000/5060: Too many data partitions HINT: Verify that the table partitioning expression is correct LOCATION: handlePartitionKey, /scratch_a/release/vbuild/vertica/EE/Operators/DataTarget.cpp
apart from data stuck on WOS can this error cause the same problem with advancing LGE?
If data cannot be moved from WOS to ROS then you have a problem.
Sometimes in order to keep your cluster up and consisten you will need to drop the object(table) responsible for the hold and rebuild it ! so that the LGE would advance.
To many partitoins endup in to many containers !!! not good ! to much work to be done by your database.
Resolution for you might be :
Make sure after that you LGE is up to date.
Adrian, thanks a lot.
we will try your recommendations.