Mergeout leave behind small ros containers

Hi My table is partition by date so each date is new partition , during the load process and few hours after , i found out that mergeout do not merge all my ros containers into single one and its stop merge a ros container when its size is about 250M , however when i run Manuel mergeout using the below method -> SELECT DO_TM_TASK('mergeout', 'BPM_TRANS_HR_S_90000_12') the merge was done as expected . I like to know if its a normal behave and if no , to know what can be done . I am attaching the output of the query form the partition table Thanks imageimage

Comments

  • Any status ?
  • Hm, I'm not sure who Amelia is looking for, but in the meantime if no one minds I can give you a quick summary: This behavior is as expected. Every time you mergeout the data, in order to combine ROS files, Vertica has to read in the entirety of all files that are being merged. If we always merged to a single file, this would mean that every mergeout did a complete scan of your entire data set even if you're trying to merge in a single tiny ROS containing just one row. That's kinda silly, so we don't do it :-) Instead, we merge incrementally: Every time mergeout runs, it takes some small files and merges them into larger files; or takes some larger files and merges them into even-larger files, etc. There are some parameters that tune when we merge stuff; you can read up on mergeout in the docs for more details. I think the exact algorithm is Vertica-proprietary; but if you look around, you'll find that a lot of the newer big-data systems do something roughly similar. (HBase has the notion of "compactions", for example, if you want an example that's rather different but is at least open-source and well documented.) If you explicitly force a mergeout, then we will in fact merge all the way to a single giant ROS. (Modulo partitioning, local segmentation, etc.) Very expensive, but sometimes what you want.
  • Thanks this good info , base on it , i have a concern regading the delete vectors which will left behind until the mergeout full mode will take place ,i am talking about records that belong to bigger vectorts that are not cosidur to be merge by the lite mergeout . in addtion can you advice on the paramer that control the time when full mergeout it take place ,i was not able to find it
  • Hi Eli, Per my last post, actually Vertica will never automatically perform a full mergeout. Generally there's no need. Relatedly, it's not correct that delete vectors will be left behind until a full mergeout takes place. Delete vectors are also cleaned up during regular incremental mergeouts. Delete vectors can also be cleaned up with an explicit PURGE command. This is somewhat (though not a lot, in many cases) cheaper than a full mergeout. Though there's an additional constraint there: Delete vectors can only be cleaned up if they are older than the "Ancient History Mark", or "AHM". The AHM points to the oldest point in time for which Vertica must, for correctness, keep old data. (Node-down is a big and sometimes unexpected cause of this -- if a node is down, Vertica may decide to keep all DVs around until it comes back up, so that the corresponding DELETEs can be propagated to it.) Here are some relevant doc links for you: https://my.vertica.com/docs/6.1.x/HTML/index.htm#14035.htm https://my.vertica.com/docs/6.1.x/HTML/index.htm#9931.htm https://my.vertica.com/docs/6.1.x/HTML/index.htm#14363.htm Adam
  • Thanks for the input I think it was grate thread . As I see it is that few random deletes that make change on big ros containers can lead to full incrementally merge , this can lead to performance degradation on the entire system , any way to minimize it ? or ETL that made the delete update should be schedule to low system traffic time ? Thanks
  • Hi Eli, Glad you think this thread was helpful! For DELETEs -- I'm not sure what you mean? There's no such thing as a "full incremental merge". A few DELETEs on a big ROS container won't necessarily lead to an incremental merge of the container. Vertica prefers to mege DELETEs during Mergeout, meaning that you don't have one big ROS container, you have two big ROS containers that need merging anyway. All of these operations are performed by Vertica's Tuple Mover. If you're worried about the Tuple Mover slowing system performance, you can tune it. See the last link in my previous post. In general, though, DELETEs do slow Vertica down. If possible for your workload, we recommend that you not use DELETE at all during ETL. For many workloads, if you spend some time thinking about them, you can use TRUNCATE or DROP PARTITION to remove the data. Or, if you're removing a lot of arbitrary rows (a big fraction of your table), do an INSERT .. SELECT or CREATE TABLE AS SELECT that copies the data to a staging table and filters out the rows that you wanted to remove. Then (possibly) use ALTER TABLE .. RENAME to swap the old table and the staging table, or MOVE_PARTITIONS_TO_TABLE to swap individual partitions. Sometimes, though, you just plain need DELETE. In that case, probably you're not deleting very much data. Don't worry about it preemptively. Try it out; see how much it affects performance on your system. If it's a problem, then you can think about scheduling ETL or tuning the Tuple Mover, etc. Adam

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file