AHM and LGE issue

Our client has LGE=AMH which is more than 2 months back in time.

 

What can be the reason LGE is not advancing from AHM?

 

Rollback to LGE does not make any sense because too much data wd be lost.

 

Wd MAKE_AHM_NOW be the right solution or we shd investigate and find the reason why LGE is so old?

 

Is MAKE_AHM_NOW safe to issue or we shd check smth else beforehand?

 

revolution=> select * from SYSTEM;

-[ RECORD 1 ]------------+--------------

current_epoch            | 9530615

ahm_epoch                | 5735993

last_good_epoch          | 5735993

 

AHM = 01.03.2015

 

 

 

Comments

  • just found in their log some errors.

    looks like MOve out fails because of too many partitions and can't change LGE.

  • Old LGE is a sign that some data is still lingering in memory (WOS) on one node  and that keeps both LGE and AHM from advancing. You could use

     

    select do_tm_task('moveout');

    select  * from projection_storage where wos_used_bytes > 0; and check which tables contain data in memory after running a moveout. There might be something preventing the data purge onto disk and this in turn can keep LGE from advancing. 

     

    Under no circumstances recover from LGE ,because you will lose all data loaded since that epoch. 

     

    This is just an investigation start, hopefully it will yield a helpful thread for you. 

     

    Regards,

     

  • ok, the partitioning is too granular and that internally leads into repetitive moveout failures. Once partitioning is addressed by increasing the grain ( coarser) then dowstream  problems should disappear. This is unfortunately one use case where a single table can upset behavior for the entire cluster . Should cluster fail now, you would need to restart from LGE.

  • Thanks a lot!

     

    this is what i found in their log:

     

    EventCodeDe scription: Timer Service Task Error ProblemDescription: threadShim: Too many data partitions DatabaseName: revolution Hostname: ... 2015-06-06 12:10:21.034 TM Moveout:0x7fd1fc010c50 <ERROR> ... 54000/5060: Too many data partitions HINT: Verify that the table partitioning expression is correct LOCATION: handlePartitionKey, /scratch_a/release/vbuild/vertica/EE/Operators/DataTarget.cpp

     

     

    apart from data stuck on WOS can this error cause the same problem with advancing LGE?

  • If data cannot be moved from WOS to ROS then you have a problem. 

    Sometimes in order to keep your cluster up and consisten you will need to drop the object(table) responsible for the hold and rebuild it ! so that the LGE would advance. 

    To many partitoins endup in to many containers !!! not good ! to much work to be done by your database.

    Resolution for you might be :

    1. Create a new table with the right partitioning.
    2. Reload the data using and export into a csv file and them bulk import or Select .. Insert operations.
    3. Drop the original table.
    4. Rename the new table to the old table. 

    Make sure after that you LGE is up to date.

  • Adrian, thanks a lot.

    we will try your recommendations.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file