Help - Optimizing DELETE performance - not understanding COMMENTS from projection_delete_concerns
Hi,
We are trying to Optimizing DELETE performance having table with 300 columns and we do batch load every 5 hours.
we did EVALUATE_DELETE_PERFORMANCE and checked comments from v_catalog.projection_delete_concerns is like this "
Please see "Optimizing Deletes for Performance" in the Administrator's Guide. The squared number of rows matching each sort key is about 3326 on average."
we found no information given guide and back to square one.
Please help, how to understand this comment and what is needed to optimize DELETE.
Thanks,
Kiran
0
Comments
Hi Kiran,
The topic reference in the message has been renamed to "Optimizing DELETEs and UPDATEs for Performance." That link goes to the topic in the HP Vertica 7.1 documentation. Searching for "Optimizing deletes" in the other versions of the documentation should give you the same topic.
The first note in the topic probably gives you the best advice. If possible, partition the table so all of the rows that you need to delete are kept in the same partition. Then its just a matter of dropping table partitions to delete the rows. See Using Partitioned Tables for more information.
Thanks Gray.
We have two dates in our table CREATEDDATE (one time assignment) and LASTMODIFIEDDATE and we are using LASTMODIFIEDDATE to determine which record needs to be deleted to insert updated record.
We have given PARTITION as this
PARTITION BY (((date_part('year', LastModifiedDate) * 100) + date_part('month',LastModifiedDate)));
can you suggest me which PARTITION is need in this case.
my table structure having 300+ cols which seems somthing like this
EMPLOYEE (
EMPLOYEE_ID CHAR(18),
EMP_NUMBER NUMERIC(18,0)
EMP_CREATED_DATE TIMESTAMP
EMP_LASTMODIFIED_DATE TIMESTAMP
EMP_NAME VARCHAR(2000)
EMP_EMAIL VARCHAR(3000)
....
....
...
....
ETL_CREATED_DATE TIMESTAMP,
ETL_UPDATED_DATE TIMESTAMP
)
SEGMENTED BY HASH(EMPLOYEE_ID) ALL NODES KSAFE 2
PARTITION BY (((date_part('year', EMP_LASTMODIFIED_DATE ) * 100) + date_part('month', EMP_LASTMODIFIED_DATE )));
Hi Kiran,
Your partitioning scheme is fine, assuming you always want to delete data based on the last modified date. Your partitioning scheme groups all of the data modified each month into their own storage containers. This is what will make deleting the data efficient.
Now all you need to do is to use the DROP_PARTITION metafunction to remove the table partitions containing the old data. This is much more efficient that using a DELETE query, because drop partitions removes the data at the storage container level. Basically, HP Vertica just deletes the storage containers containing the data partitions.
So, to remove all of the data with a last modified timestamp in January 2012, you use the statement:
=> SELECT DROP_PARTITION('EMPLOYEES',201201);
You can see the partition keys defined in your EMPLOYEES table by querying the V_MONITOR.PARTITONS table:
=> SELECT partition_key, projection_name, node_name FROM v_monitor.partitions WHERE projection_name ILIKE 'EMPLOYEES%' ORDER BY partition_key;
partition_key | projection_name | node_name
---------------+-----------------+------------------
201202 | EMPLOYEES_b2 | v_vmart_node0001
201202 | EMPLOYEES_b1 | v_vmart_node0001
201202 | EMPLOYEES_b0 | v_vmart_node0001
201202 | EMPLOYEES_b2 | v_vmart_node0002
201202 | EMPLOYEES_b1 | v_vmart_node0002
201202 | EMPLOYEES_b0 | v_vmart_node0002
201202 | EMPLOYEES_b2 | v_vmart_node0003
201202 | EMPLOYEES_b1 | v_vmart_node0003
201202 | EMPLOYEES_b0 | v_vmart_node0003
201203 | EMPLOYEES_b0 | v_vmart_node0001
201203 | EMPLOYEES_b0 | v_vmart_node0002
201203 | EMPLOYEES_b0 | v_vmart_node0003
201203 | EMPLOYEES_b2 | v_vmart_node0001
201203 | EMPLOYEES_b1 | v_vmart_node0001
201203 | EMPLOYEES_b2 | v_vmart_node0002
. . .
Thanks Gray.
If i use drop partition then i will loose all records but i want to update one record. If we use UPDATE statement then it is taking ever so we are doing like DELETE old records and INSERT new record.
unfortuantely, DELETE performance really poor even though we created DBD suggested projections
can you please help us.
Thanks,
Kiran
Hi Kiran,
If you cannot DELETE data by partition, then your best bet is to follow the recommendations in Optimizing DELETEs and UPDATEs for Performance. This includes using a projection that contains just the columns that you use in your UPDATE queries, and when possible sorting the projections based on the columns you are using in the WHERE clause of your UPDATE query.