Watch those delete vectors!

Bryan_HBryan_H Vertica Employee Administrator

Vertica is very good at ingesting data, compressing it, and querying at high speed. The tradeoff here is that the data are stored in large block files called ROS containers. These containers can grow to large sizes, sometimes over 10 GB, and this makes it impractical to decompress and edit the files during updates and deletes.

As a tradeoff to make changes work in a timely fashion, Vertica records the deletes in a view called a delete vector, and when you query a table with deleted or changed data, Vertica reads the ROS containers for the table, replays the delete vector on the result set, and returns the filtered set.

This gets rather inefficient as delete vectors grow. Because rewriting large ROS containers is an intensive operation, Vertica doesn't usually do it automatically; instead, you should monitor the V_MONITOR.DELETE_VECTORS table for delete activity, especially if delete and update are regular operations.

Some workarounds can help:

  • Lower the threshold where Vertica auto-purges tables. This is 20% by default. The down side is that purge will be triggered at arbitrary times and could affect system performance.
  • Schedule a manual purge or mergeout to merge to update the ROS containers during idle time.
  • Use partitioning to delete older data. Partitions map to ROS containers, so dropping a partition is much more efficient, as it simply deletes the ROS containers for the partition. This can be much more efficient for managing data by date, for example.
  • Use a staging table to preserve or merge records, then swap tables with ALTER TABLE oldTable, newTable RENAME TO newTable, oldTable; then drop the old table. (Be sure to preserve projections in the new staging table! E.g., CREATE TABLE newTable LIKE oldTable INCLUDING PROJECTIONS;)

Purging deleted data or dropping partitions will also reduce the size of the ROS containers on disk, so you could reclaim significant amounts of disk space too.

Here's an example showing how delete is handled:

dbadmin=> CREATE TABLE delete_test (c INT);
CREATE TABLE

dbadmin=> INSERT INTO delete_test SELECT 1;
OUTPUT
--------
      1
(1 row)

dbadmin=> INSERT INTO delete_test SELECT 2;
OUTPUT
--------
      1
(1 row)

dbadmin=> INSERT INTO delete_test SELECT 3;
OUTPUT
--------
      1
(1 row)

If I run "SELECT * FROM delete_test", I would get 3 rows. This is how many rows are stored:

dbadmin=> select anchor_table_name, ros_row_count, ros_used_bytes from projection_storage where anchor_table_name = 'delete_test';
 anchor_table_name | ros_row_count | ros_used_bytes
-------------------+---------------+----------------
 delete_test       |             3 |             53
(1 row)

dbadmin=> delete from delete_test where c = 2;
 OUTPUT
--------
      1
(1 row)

If I run "SELECT * FROM delete_test" now, I would get 2 rows. However, this is how many rows are stored:

dbadmin=> select anchor_table_name, ros_row_count, ros_used_bytes from projection_storage where anchor_table_name = 'delete_test';
 anchor_table_name | ros_row_count | ros_used_bytes
-------------------+---------------+----------------
 delete_test       |             3 |             53
(1 row)

Wait! All three rows are still there! This is because Vertica stored the delete as a delete vector:

dbadmin=> SELECT projection_name, deleted_row_count FROM delete_vectors where projection_name like 'delete_test%';
  projection_name  | deleted_row_count
-------------------+-------------------
 delete_test_super |                 1
(1 row)

This delete vector is used to mask the deleted record. To get rid of the deleted record from ROS permanently, we purge the table:

dbadmin=> select PURGE_TABLE('delete_test');
                                        PURGE_TABLE
-------------------------------------------------------------------------------------------
 Task: purge operation
(Table: public.delete_test) (Projection: public.delete_test_super)

(1 row)

Now the delete vector and row are gone:

dbadmin=> SELECT projection_name, deleted_row_count FROM delete_vectors where projection_name like 'delete_test%';
 projection_name | deleted_row_count
-----------------+-------------------
(0 rows)

dbadmin=> select anchor_table_name, ros_row_count, ros_used_bytes from projection_storage where anchor_table_name = 'delete_test';
 anchor_table_name | ros_row_count | ros_used_bytes
-------------------+---------------+----------------
 delete_test       |             2 |             52
(1 row)

References:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/SystemTables/MONITOR/DELETE_VECTORS.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/BulkDeletingPurging/PurgingDeletedData.htm

Have fun!

Comments

  • Bryan, what is the number (count) of delete vectors that Vertica considers alarming for performance? We have 3 node cluster and the count of delete vectors grows sometimes to 4000.

  • Bryan_HBryan_H Vertica Employee Administrator

    Hi, I would say it's not a count so much as a percentage or data size that I worry about. As noted, the default auto-purge of 20% is often too high for tables with >1B rows, since this means you could query 1B rows then have to filter 50-100M deleted. Probably the biggest signs would be visible change in performance leading to user complaints, or you can watch the vertica.log or system tables for lots of DVROS events indicating a lot of processing time is going to delete processing.
    There's never a reason to leave them around though, and Vertica will probably clean up ~4000 delete vectors fairly quickly, so if you know of a time to schedule a purge job on a nightly or weekly basis, a simple cron script would work.

  • emorenoemoreno Employee

    Hi, The % of delete data to be purged automatically with the tuple mover is the % of rows of the ROS container that the delete rows belong, so not necessarily the whole table. What it means is if you have a big table partition by something, if you do delete in the latest partition, there is a big chance that the latest partition data will have small ROS containers what will be easier to reach the 20%. If your delete data from all the partitions, it may have big ROS containers in the inactive ( old ) partitions and those probably will not easily reach the 20% of delete rows to be qualified to automatically be purged.

    Maybe I created more confusion, the doc about best practices to delete data may be useful :
    https://www.vertica.com/kb/Best-Practices-for-Deleting-Data/Content/BestPractices/Best-Practices-for-Deleting-Data.htm

    If you have further questions please let us know.
    Eugenia

Sign In or Register to comment.