Optimize projection for delete

Hi, I have a (big) table (23B rows right now) which is meant to store only 3 months of data, without duplicates. We thus run 2 delete statements daily or more: - delete where date < 90 days - delete duplicates I am trying to optimise those steps which are taking way too long. Info: Vertica 6.1.0, 5 nodes, k-safe=2 Table (with only the relevant columns displayed to limit the size of the text):
  CREATE TABLE analytics.set_pageview  (      visitorId int,      sessionId int,      pageInSession int,      pageType varchar(32),      siteId int,      channelId int,      pageviewDate date,     [14 extra rows]  );      CREATE PROJECTION analytics.set_pageview_date /*+createtype(L)*/  (   visitorId,   sessionId,   pageInSession,   pageType,   siteId ENCODING RLE,   channelId ENCODING RLE   pageviewDate ENCODING RLE,  [14 extra columns]  )  AS   SELECT set_pageview.visitorId,          set_pageview.sessionId,          set_pageview.pageInSession,          set_pageview.pageType,          set_pageview.siteId,          set_pageview.channelId,          set_pageview.pageviewDate,  [14 extra columns]   FROM analytics.set_pageview   ORDER BY set_pageview.pageviewDate,            set_pageview.visitorId,            set_pageview.sessionId,            set_pageview.pageInSession,            set_pageview.siteId,            set_pageview.channelId,            set_pageview.pageType  SEGMENTED BY hash(set_pageview.pageviewDate, set_pageview.channelId, set_pageview.siteId, set_pageview.pageType, set_pageview.pageInSession, set_pageview.visitorId, set_pageview.sessionId) ALL NODES KSAFE 2;  
As you can see, I forced the RLE on a few columns which I know are low cardinality (less than 1000 rows, 90 only for page date). Then the 2 statements I am running are:
   explain delete  from analytics.set_pageview WHERE pageviewDate < CURRENT_DATE() - 90;     Access Path:   +-DML DELETE [Cost: 0, Rows: 0]   |  Target Projection: analytics.set_pageview_gui_date_b2   |  Target Projection: analytics.set_pageview_gui_date_b1   |  Target Projection: analytics.set_pageview_gui_date_b0   |  Target Prep:   |  Execute on: All Nodes   | +---> STORAGE ACCESS for <NO alias=""> [Cost: 8M, Rows: 921M (NO STATISTICS)] (PATH ID: 1)   | |      Projection: analytics.set_pageview_gui_date_b1   | |      Materialize: set_pageview_gui.pageviewDate, set_pageview_gui.visitorId, set_pageview_gui.sessionId, set_pageview_gui.pageInSession, set_pageview_gui.siteId, set_pageview_gui.channelId, set_pageview_gui.pageType, set_pageview_gui.globalUserId, set_pageview_gui.portalVersion, set_pageview_gui.pageId, set_pageview_gui.pageTypeDetail, set_pageview_gui.pageviewDts, set_pageview_gui.referrerTag, set_pageview_gui.campaignMedium, set_pageview_gui.campaignSource, set_pageview_gui.campaignName, set_pageview_gui.campaignElement, set_pageview_gui.adwordsAdgroupId, set_pageview_gui.adwordsKeyword, set_pageview_gui.isAuthorized, set_pageview_gui.searchTerm, set_pageview_gui.timeOnPage, set_pageview_gui.epoch   | |      Filter: (set_pageview_gui.pageviewDate < (('now()')::date - 90))   | |      Execute on: All Nodes  </NO>
Note that I just ran analyze_statistics, so I am not sure what the 'no statistics' means. Q: how can I have the filter applied before the materialisation? I though the projection was perfectly optimised for this case. This is the second delete statement:
  explain delete   from  analytics.set_pageview_gui   where (visitorId , sessionId, pageInSession, pageviewDts) in  (                                                                   select   visitorId, sessionId, pageInSession, min(pageviewDts)                                                                   from     analytics.set_pageview_gui                                                                   group by visitorId, sessionId, pageInSession                                                                   having   count(*) > 1                                                                  );     Access Path:   +-DML DELETE [Cost: 0, Rows: 0]   |  Target Projection: analytics.set_pageview_gui_date_b2   |  Target Projection: analytics.set_pageview_gui_date_b1   |  Target Projection: analytics.set_pageview_gui_date_b0   |  Target Prep:   |  Execute on: All Nodes   | +---> JOIN HASH [Semi] [Cost: 262M, Rows: 12B] (PATH ID: 1) Inner (BROADCAST)   | |      Join Cond: (set_pageview_gui.visitorId = VAL(2)) AND (set_pageview_gui.sessionId = VAL(2)) AND (set_pageview_gui.pageInSession = VAL(2)) AND (set_pageview_gui.pageviewDts = VAL(2))   | |      Materialize at Output: set_pageview_gui.globalUserId, set_pageview_gui.portalVersion, set_pageview_gui.pageId, set_pageview_gui.pageType, set_pageview_gui.pageTypeDetail, set_pageview_gui.referrerTag, set_pageview_gui.campaignMedium, set_pageview_gui.campaignSource, set_pageview_gui.campaignName, set_pageview_gui.campaignElement, set_pageview_gui.adwordsAdgroupId, set_pageview_gui.adwordsKeyword, set_pageview_gui.isAuthorized, set_pageview_gui.searchTerm, set_pageview_gui.siteId, set_pageview_gui.channelId, set_pageview_gui.timeOnPage, set_pageview_gui.pageviewDate, set_pageview_gui.epoch   | |      Execute on: All Nodes   | | +-- Outer -> STORAGE ACCESS for <NO alias=""> [Cost: 43M, Rows: 23B] (PATH ID: 2)   | | |      Projection: analytics.set_pageview_gui_date_b2   | | |      Materialize: set_pageview_gui.visitorId, set_pageview_gui.sessionId, set_pageview_gui.pageInSession, set_pageview_gui.pageviewDts   | | |      Execute on: All Nodes   | | |      Runtime Filters: (SIP1(HashJoin): set_pageview_gui.visitorId), (SIP2(HashJoin): set_pageview_gui.sessionId), (SIP3(HashJoin): set_pageview_gui.pageInSession), (SIP4(HashJoin): set_pageview_gui.pageviewDts), (SIP5(HashJoin): set_pageview_gui.visitorId, set_pageview_gui.sessionId, set_pageview_gui.pageInSession, set_pageview_gui.pageviewDts)   | | +-- Inner -> SELECT [Cost: 146M, Rows: 53M] (PATH ID: 3)   | | |      Execute on: All Nodes   | | | +---> GROUPBY HASH (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 146M, Rows: 53M] (PATH ID: 4)   | | | |      Aggregates: count(*), min(set_pageview_gui.pageviewDts)   | | | |      Group By: set_pageview_gui.visitorId, set_pageview_gui.sessionId, set_pageview_gui.pageInSession   | | | |      Filter: (<SVAR> > 1)   | | | |      Execute on: All Nodes   | | | | +---> STORAGE ACCESS for set_pageview_gui [Cost: 43M, Rows: 23B] (PATH ID: 5)   | | | | |      Projection: analytics.set_pageview_gui_date_b1   |  Target Projection: analytics.set_pageview_gui_date_b2   |  Target Projection: analytics.set_pageview_gui_date_b1   |  Target Projection: analytics.set_pageview_gui_date_b0   |  Target Prep:   |  Execute on: All Nodes   | +---> JOIN HASH [Semi] [Cost: 262M, Rows: 12B] (PATH ID: 1) Inner (BROADCAST)   | |      Join Cond: (set_pageview_gui.visitorId = VAL(2)) AND (set_pageview_gui.sessionId = VAL(2)) AND (set_pageview_gui.pageInSession = VAL(2)) AND (set_pageview_gui.pageviewDts = VAL(2))   | |      Materialize at Output: set_pageview_gui.globalUserId, set_pageview_gui.portalVersion, set_pageview_gui.pageId, set_pageview_gui.pageType, set_pageview_gui.pageTypeDetail, set_pageview_gui.referrerTag, set_pageview_gui.campaign  Medium, set_pageview_gui.campaignSource, set_pageview_gui.campaignName, set_pageview_gui.campaignElement, set_pageview_gui.adwordsAdgroupId, set_pageview_gui.adwordsKeyword, set_pageview_gui.isAuthorized, set_pageview_gui.searchTerm, se  t_pageview_gui.siteId, set_pageview_gui.channelId, set_pageview_gui.timeOnPage, set_pageview_gui.pageviewDate, set_pageview_gui.epoch   | |      Execute on: All Nodes   | | +-- Outer -> STORAGE ACCESS for <NO alias=""> [Cost: 43M, Rows: 23B] (PATH ID: 2)   | | |      Projection: analytics.set_pageview_gui_date_b2   | | |      Materialize: set_pageview_gui.visitorId, set_pageview_gui.sessionId, set_pageview_gui.pageInSession, set_pageview_gui.pageviewDts   | | |      Execute on: All Nodes   | | |      Runtime Filters: (SIP1(HashJoin): set_pageview_gui.visitorId), (SIP2(HashJoin): set_pageview_gui.sessionId), (SIP3(HashJoin): set_pageview_gui.pageInSession), (SIP4(HashJoin): set_pageview_gui.pageviewDts), (SIP5(HashJoin):  set_pageview_gui.visitorId, set_pageview_gui.sessionId, set_pageview_gui.pageInSession, set_pageview_gui.pageviewDts)   | | +-- Inner -> SELECT [Cost: 146M, Rows: 53M] (PATH ID: 3)   | | |      Execute on: All Nodes   | | | +---> GROUPBY HASH (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 146M, Rows: 53M] (PATH ID: 4)   | | | |      Aggregates: count(*), min(set_pageview_gui.pageviewDts)   | | | |      Group By: set_pageview_gui.visitorId, set_pageview_gui.sessionId, set_pageview_gui.pageInSession   | | | |      Filter: (<SVAR> > 1)   | | | |      Execute on: All Nodes   | | | | +---> STORAGE ACCESS for set_pageview_gui [Cost: 43M, Rows: 23B] (PATH ID: 5)   | | | | |      Projection: analytics.set_pageview_gui_date_b1   | | | | |      Materialize: set_pageview_gui.visitorId, set_pageview_gui.sessionId, set_pageview_gui.pageInSession, set_pageview_gui.pageviewDts   | | | | |      Execute on: All Nodes  </SVAR></NO></SVAR></NO>
I wonder how to optimise the deletes for those 2 statements together. Thanks for any help,

Comments

  • Hi To do delete where date < 90 days; and if you only keep 3 month of data, I will recommend you to partition the table by date and do just a drop partition. that should be almost instantly. To do delete duplicates You have to warranty that you will have the key that identify as duplicated in the on the order by of the projection. If you have this partition, and if you have too many duplicated, sometimes it can even be faster to create a temp table with the no duplicated records of the partition. Drop the partition and do a insert from that temp table. Does make sense? The other thing that I don't see the point is the k-safe =2. In version 4.1 and older it was important so you can lose 2 nodes in the cluster. in Version 5 or newer a new concept was introduced, data safety. This means that your cluster can continue working with (n/2)+1 nodes, so in 5 nodes you can continue working with 3 nodes even if you are k-safe =1. If you are k-safe=2 the only thing that changes is the nodes that you can lose, not the number, but you are paying the price of storage, load ( need to load one more projection) and deletes (need to delete one more projection too). So please review if you really need k-safe=2. Hope this is clear and helps, Thanks, Eugenia
  • Hi Eugenia, Thanks a lot! I completely overlooked the partitions and it should help enormously. Sorry for asking theoritical questions, but I am trying to gather some information during your office hours and before mine :) Delete < 90 days Partitioning by pageviewDate is an obvious choice. Would it be wise to use this field as-is and thus end up with about 90 partitions? Delete duplicates I can garanty that all columns from the key are in the ORDER BY clause from the projection. To properly optimise for both deletes, would it be wise to - partition by pageviewDate (for the < 90 days) - segment and order the projection by all the unique keys, without adding pageviewDate neither in the order by, neither in the segment by of the projection? Ksafety The ksafe 2 comes 'by default', probably because I set up SELECT MARK_DESIGN_KSAFE(2) a long time ago. Eg.:
      CREATE table sand_guillaume.ksafe (id INT, stuff VARCHAR(42));  INSERT INTO guillaume.ksafe VALUES (1, 'yes');  COMMIT;  select export_objects('', 'sand_guillaume.ksafe');  CREATE TABLE sand_guillaume."ksafe"  (      id int,      stuff varchar(42)  );      CREATE PROJECTION sand_guillaume."ksafe" /*+createtype(L)*/   (   id,   stuff  )  AS   SELECT "ksafe".id,          "ksafe".stuff   FROM sand_guillaume."ksafe"   ORDER BY "ksafe".id,            "ksafe".stuff  SEGMENTED BY hash("ksafe".id, "ksafe".stuff) ALL NODES KSAFE 2;      SELECT MARK_DESIGN_KSAFE(2);  
    This comes from https://my.vertica.com/docs/6.1.x/HTML/index.htm#4865.htm If I understand well what you say, I should still set the k-safety, but 1 is enough? If later we go to 10 nodes instead of the current 5, we will be able to lose 4 nodes even with k-safe == 1 and still be able to work? Thanks again for your help!
  • Hi, To answer the question 1- To chose a partition key, it is important that you chose something that a- will help you to purge the data fast (delete < 90 ) perfect example b- Something that you can use in your predicate such as date = '05/30/2012' so you can do some partition pruning and your queries will be faster/ 2- To segment the table, chose a filed of high cardinality so you warranty that the data is distributed even between the nodes and skewed. Not sure why you mention date here. It should not be date. At difference when you partition you should chose a field of low cardinality in this case you can have 90 different values so it is a good choice. About the order by, DBD should do a good job, but the rule of thumb is that in the order by first put fields that are in your predicate, then in your join and then group by. If you partition by date, you could put date first in the order by, have it encoded as RLE and it will be just one value stored in disk. It should not affect performance. However, I insist let the DBD chose the right projections for your queries. It does a great job and you can put your delete statement there so DBD design projections considering them. 3- About k-safe: yes, it creates two because you have the select mark_design_ksafe(2). You can verify your fault tolerance by doing : select designed_fault_tolerance from system;. It should say 1 or 2. To go to k-safe = 1 select mark_desing_ksafe(1) and remove the extra projection. If you have a cluster with 10 nodes you could lose 4 and continue working. In the moment that you lose one node you can do select * from critical_nodes and will tell you what are the nodes that if you lose the database will go down (note that if all the nodes are UP this table will be empty). For example in a 10 nodes cluster you could lose node 1, 3, 5, 7 and Vertica will stay UP. Hope this help, All the best. Eugenia
  • Thanks a lot, I will try all this out.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file