Same query executes 2 times in a row

Hi,
I'm experiencing a strange issue that I can't explain. Sometimes when an INSERT ... SELECT query gets executed from our ETL tool (Pentaho) I can see in v_monitor.query_requests that it for some reason tried to be executed 2 times in a row. First time memory_acquired_mb is usually very large (in this case success=False in v_monitor.query_requests) and the second time memory_acquired_mb is often significantly smaller (in this case success=True in v_monitor.query_requests).
My first guess was that engine tries to execute the query in memory, fails to do so and proceeds with the execution on disk. However, in this case I would expect to see events like GROUP_BY_SPILLED or JOIN_SPILLED, but I don't see these events.
Could there be other reasons for this behaviour?
I can't share an SQL query, because this issue appears for many queries we have in our ETL process - therefore I don't think it's query-specific. What I can say is that we're on Vertica 9.3.1 and we have a cluster of 3 nodes (if this information is useful for this question).
Best regards,
Vlad

Best Answer

  • moshegmosheg Vertica Employee Administrator
    edited December 2020 Answer ✓
    1. To detect a retry, PROFILE the query or check vertica.log
      If hash table cannot fit in memory, join must spill to disk
      Both the fact and dimension must spill to disk
      Query tries to fit in memory first; if it cannot fit, it retries automatically with join spill enabled
      If you know a query will spill, add hint to skip hashing to memory
      SELECT /+add_vertica_options(EE,ENABLE_JOIN_SPILL)/ …
      Loading into pre-join projections requires ENABLE_JOIN_SPILL set if join hash table exceeds memory.

    2. Try to see how much memory the query requires compared to the RP memory budget.
      E.g. SELECT pool_name, thread_count, open_file_handle_count, memory_inuse_kb, queue_entry_timestamp,
      acquisition_timestamp FROM V_MONITOR.RESOURCE_ACQUISITIONS WHERE node_name ILIKE '%node0001';

    3. Check how much memory the query really needs..
      create resource pool p_pool memorysize '1K' plannedconcurrency 4 maxconcurrency 4;
      set session resource_pool to p_pool;
      profile ... your query;
      If you don't have the memory usage printed (which may happen)
      select * from resource_acquisitions where transaction_id=... and statement_id=...;
      The trick: p_pool with very small memorysize.
      (if you run profile on general pool instead, it will use the query_budget instead on minimal memory needed)
      Of course p_pool is only for profiling.

    4. Run the following query to check your Vertica events status.
      Recommends actions one can take to resolve event issues are listed here.
      select event_type,event_description, count(*)
      from query_events
      where ( NOW() - EVENT_TIMESTAMP) < '24 hour'
      group by 1,2
      order by 3 desc;

Answers

  • Thank you, mosheg!
    This is a very detailed and useful answer. There's only one thing I don't understand in this context. In case of retry I expect to see some "SPILL" event (like GROUP_BY_SPILLED or JOIN_SPILLED) in v_monitor.query_events. However, I don't see any events for affected transactions.

  • Jim_KnicelyJim_Knicely Administrator

    You should be able to check the DC_REQUESTS_ISSUED and DC_REQUESTS_RETRIED Data Collector tables to find out which queries retried and why.

    Example:

    dbadmin=> CREATE RESOURCE POOL cascade_me CASCADE TO general RUNTIMECAP '1 Second';
    CREATE RESOURCE POOL
    
    dbadmin=> SET RESOURCE_POOL = cascade_me;
    SET
    
    dbadmin=> SELECT current_value, description FROM configuration_parameters WHERE parameter_name = 'CascadeResourcePoolAlwaysReplan';
     current_value |                                              description
    ---------------+--------------------------------------------------------------------------------------------------------
     1             | If true, cascade to a target resource pool will always lead to replanning the query on the target pool
    (1 row)
    
    dbadmin=> SELECT COUNT(*) FROM biggie;
      COUNT
    ---------
     8388608
    (1 row)
    
    dbadmin=> SELECT a.c FROM biggie a JOIN (SELECT c FROM biggie LIMIT 100) b USING (c) ORDER BY a.c DESC LIMIT 5;
     c
    ---
     A
     A
     A
     A
     A
    (5 rows)
    
    dbadmin=> SELECT transaction_id, statement_id, request FROM dc_requests_issued WHERE is_retry ORDER BY time DESC LIMIT 1;
      transaction_id   | statement_id |                                                request
    -------------------+--------------+-------------------------------------------------------------------------------------------------------
     45035996273908889 |            4 | SELECT a.c FROM biggie a JOIN (SELECT c FROM biggie LIMIT 100) b USING (c) ORDER BY a.c DESC LIMIT 5;
    (1 row)
    
    dbadmin=> SELECT transaction_id, statement_id, request, reason_for_retry FROM dc_requests_retried ORDER BY time DESC LIMIT 1;
      transaction_id   | statement_id |                                                request                                                |                               reason_for_retry
    -------------------+--------------+-------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------
     45035996273908889 |            4 | SELECT a.c FROM biggie a JOIN (SELECT c FROM biggie LIMIT 100) b USING (c) ORDER BY a.c DESC LIMIT 5; | Move Resource Pool failure: Execution time exceeded run time cap of 00:00:01
    (1 row)
    
  • Thanks @Jim_Knicely , I didn't know about dc_requests_issued and dc_requests_retried - that is a useful piece of information.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file