We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


PROFILE vs EXPLAIN — Vertica Forum

PROFILE vs EXPLAIN

Hello,
I'm trying to assess the cost of a huge query, now when I run it with EXPLAIN, it shows
SELECT LIMIT 1K [Cost: 572M, Rows: 1K (NO STATISTICS)] (PATH ID: 30)
but when I run it with PROFILE, it shows
NOTICE 5077: Total memory required by query: [49320002 KB]
as you can see there's a huge difference in memory consumption, why do these commands generate two different outputs and what am I reading wrong here?

Comments

  • LenoyJLenoyJ - Select Field - Employee
    The 572M in your explain plan does NOT mean 572 MB of memory. It is a number that is generated based on many factors (row count, lowest I/O, CPU, memory, network etc.) Vertica uses a cost based optimizer to choose the best query plan with (generally) the lowest total cost.
    Read more here: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/QueryManagement/QueryPlans/QueryPlanCostEstimation.htm

    PS: It is advisable to analyze statistics so that the optimizer can be more accurate in estimating costs and thereby generating better query plans. The plan states that the table has no statistics.
  • Cost is an interesting metric, but a large query might not be terribly costly, if you only run it one time. Often, the more "expensive" queries are the ones that are medium-sized which run all the time. Those are the ones which are often worth trying to tune, since the total cost savings might be much larger.

    I wrote a query to find them:

    most common queries in the system, but calculates a "cost" metric for the
    -- query, based on threads, memory, duration. Also includes the last transaction_id so you can see the full
    -- query if you want, and the range of the times it's ran from and to.
    -- Note: This query specifically targets just JOIN and GROUP BY SPILL events, because coincidentally those
    -- are the most expensive queries, but if you want to target a broader set of queries, you can remove that
    -- line (near the bottom)
    select REGEXP_REPLACE(REGEXP_REPLACE(substr(qr.request, 1, 500), '\d'), '\''[\S\s]?\''', '''?''' ) as query
    , max(qr.transaction_id) as last_trans_id
    , avg(qr.request_duration_ms)/1000/60 as avg_duration_mins
    , avg(ra.thread_count) as avg_threads
    , to_char(avg(memory_inuse_kb)/1024, '999,999') as avg_mem_mb
    , (sum(qr.request_duration_ms) * count(1))/1000 as total_duration_secs
    , max(eee.event_type) as event_type
    , trunc(min(qr.start_timestamp), 'MI') as first_run
    , trunc(max(qr.start_timestamp), 'MI') as last_run
    , (sum(qr.request_duration_ms) * count(1)) * sum(ra.thread_count) as total_cost
    , count(1) as num_runs
    from v_monitor.query_requests qr
    join (select transaction_id, statement_id, event_type
    from v_internal.dc_execution_engine_events
    where node_name = (select min(node_name)
    from v_catalog.nodes)
    ) eee using (transaction_id, statement_id)
    left join (select transaction_id, statement_id, thread_count, memory_inuse_kb, duration_ms
    from v_monitor.resource_acquisitions
    where node_name = (select min(node_name)
    from v_catalog.nodes)
    ) ra using (transaction_id, statement_id)
    where start_timestamp::date > current_date -1
    and not qr.is_executing
    and eee.event_type in ('GROUP_BY_SPILLED', 'JOIN_SPILLED')
    group by 1
    having count(
    ) > 3
    order by total_cost desc ;

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file