when rows get removed physically from vertica
Please forgive the lenght of this post, but I wanted to include enough details.
I'm trying to get a better understanding of when rows get physcially removed
select schema_name, projection_name, sum(deleted_row_count)
from delete_vectors
group by schema_name, projection_name
order by 3 desc
limit 1;
schema_name | projection_name | sum
-------------+-------------------------------------------------+-----------
QIW_STAGE | TABLE1_b0 | 152534315
So, the table has undergone a lot of deletions, with the _b0 projection having 152 million deleted rows.vertica=> select count(*) from qiw_stage.table1
count
----------
19673884
And the table only has roughly 20 million rows !!
I understand that Vertica will never physically delete rows unless there were deleted BEFORE the current Ancient History Mark.
vertica=> select get_ahm_epoch();
get_ahm_epoch
---------------
2077987
vertica=> select get_current_epoch();
get_current_epoch
-------------------
2077988
vertica=> select get_ahm_time();
get_ahm_time
-------------------------------------------------
Current AHM Time: 2014-10-26 20:02:49.976049-04
So, the current epoch is 2077988, and the AHM is set to epoch 2077987, which corresponds to roughly 10 PM last night.
Also, I understand that when doing mergouts, Vertica will tend to choose the smallest containers to merge and very large containers may stick around for a long time before getting merged. Therefore, if you have a very large container that has a lot of deleted rows, it may never actually get the deleted rows purged. However, there is a parameter called PurgeMergeoutPercent. If I understand what this parameter does correctly, it causes the mergeout process to consider merging containers (even big ones) whenever the amount of deleted rows gets larger than the this percentage.
vertica=> select get_config_parameter('PurgeMergeoutPercent');
get_config_parameter----------------------
20
This is where I really get a little confused as to what is going on. I wanted to find out what percentage of deleted rows was in each storage container. The STORAGE_CONTAINERS table shows each ROS container. Included in this table is how many deleted rows are in each container.
vertica=> select node_name, projection_name, total_row_count, deleted_row_count, start_epoch, end_epoch
from storage_containers where projection_name = 'TABLE1;
node_name | total_row_count | deleted_row_count | start_epoch | end_epoch
------------------------+-----------------+-------------------+-------------+-----------
v_vertica_dev_node0001 | 165984 | 165984 | 1971366 | 1971366
v_vertica_dev_node0001 | 45536661 | 45536661 | 1968722 | 1971405
v_vertica_dev_node0001 | 166445 | 166445 | 1976713 | 1976713
v_vertica_dev_node0001 | 201824 | 201824 | 1976745 | 1976746
. .
v_vertica_dev_node0001 | 999299 | 0 | 2077509 | 2077514
v_vertica_dev_node0001 | 166445 | 0 | 2077515 | 2077515
v_vertica_dev_node0001 | 1333435 | 0 | 2077516 | 2077523
v_vertica_dev_node0001 | 1333430 | 0 | 2077524 | 2077531
v_vertica_dev_node0001 | 1332719 | 0 | 2077532 | 2077539
v_vertica_dev_node0001 | 1333065 | 0 | 2077540 | 2077547
v_vertica_dev_node0001 | 57827 | 0 | 2077548 | 2077548
...
If I’m reading this correctly, all containers have either deleted all rows or no rows. And for the first two containers listed (for example), all of the changes in this container happened LONG before the current epoch.
I know we can go through and manually purge these rows if we want to using the purge() functions, but if you use any of these on a partitioned projection, it removes the partitioning, so I’m hesitant to use this. I’m not even sure that having all these deleted rows around is a problem. I'm just getting the impression that these rows may never be removed unless I use one of the purge() functions. Any insight into this behavior would be appreciated.
Comments
Hm... Just to confirm, have you tried doing an explicit "select do_tm_task('mergeout');", to make sure that Vertica has attempted to do a mergeout?
My understanding is that purge() does not remove ordinary table partitioning. It might, however, undo custom partition-level manipulations such as merge_partitions(). (Several such partition-level functions are currently deprecated -- yes, they do have their uses; but the current APIs have so many gotcha's like the one above that they tend to cause more headaches than they are worth. See the Vertica documentation for details.)
Adam