Vertica new cluster running slow

We just got a new cluster, suppose will be much faster than old one. but it was much slower than old one.
I have checked a few system tables and did not find anything.
I noticed even using "Copy " command or "Insert /*+ direct */ into tab a (select * from tab b)" are slow as well.
anything hint for this? much appreciate!!

Tagged:

Best Answers

  • moshegmosheg Vertica Employee Administrator
    edited May 2020 Answer ✓

    The spill can happen on both environments, however the time it takes until it will finish the first plan of the query on a system with more memory might be longer.
    It is advised to add a SPILL hint to the specific relevant query or better tune the query or JOIN to not exceed the query budget.

Answers

  • sahil_kumarsahil_kumar Employee
    edited May 2020

    Any other software, Hardware upgrade, Any changes in the ETL or reporting queries etc

    • Check if any of the system resource (memory,cpu,network or disk IO) saturated.
      select * from memory_usage order by average_memory_usage_percent desc ;
      select * from cpu_usage order by average_cpu_usage_percent desc ;
      select * from network_usage order by (rx_kbytes_per_second+tx_kbytes_per_second) desc ;
      select * from io_usage order by (read_kbytes_per_second+written_kbytes_per_second) desc ;
  • thanks, Kumar
    memory and cpu are 20%, it should be fine.
    network_usage top node around 74M/sec (tx) and 84M/sec (rx)
    io_usage top node around: 30kb/sec for read and 169M/sec for written
    are these metrics look normal?

  • another fact is: new nodes are 9x and old nodes are 8x. are there any configurations will be different setting between two versions?
    I suspected maybe some configuration settings were not correct. I was an fan of Vertica, fast,powerful and easy to use.
    very appreciate your feed back.

  • moshegmosheg Vertica Employee Administrator

    If your new cluster nodes have more memory, in case where V pick very large fact on inner,
    and force spill did not happen, the first try can take much longer because there is more memory until it will spill on the second try.
    Try to force join spill to occur on the first plan of the query, thus avoiding a replan.

    -- Here’s a query that will determine all spills in the last 24hrs:
    SELECT 
        time, 
        transaction_id, 
        statement_id, 
        request_id, 
        event_type, 
        event_description 
    FROM dc_execution_engine_events 
    WHERE (NOW() - time) < '24 hours' 
        AND event_type ILIKE '%SPILL%' 
    ORDER BY time DESC;
    
                 time              |  transaction_id   | statement_id | request_id |     event_type      |                           event_description             
    -------------------------------+-------------------+--------------+------------+---------------------+-----------------------------------------------------------------------
     2020-05-19 11:28:27.776255-04 | 45035996273713491 |            1 |          0 | GROUP_BY_SPILLED    | GROUP BY key set did not fit in memory, using external sort grouping.
     2020-05-19 11:28:27.775592-04 | 45035996273713491 |            1 |          0 | GROUP_BY_SPILLED    | GROUP BY key set did not fit in memory, using external sort grouping.
     2020-05-19 11:28:27.77525-04  | 45035996273713491 |            1 |          0 | GROUP_BY_SPILLED    | GROUP BY key set did not fit in memory, using external sort grouping.
     2020-05-19 10:27:16.202986-04 | 45035996273713284 |            1 |          1 | SYSQUERY_POOL_SPILL | Sysquery Memory Full; spilled to temp ROS files
     2020-05-19 10:26:51.825093-04 | 45035996273713282 |            1 |          1 | SYSQUERY_POOL_SPILL | Sysquery Memory Full; spilled to temp ROS files
    (5 rows)
    
  • thanks, Kumar
    memory and cpu are 20%, it should be fine.
    network_usage top node around 74M/sec (tx) and 84M/sec (rx)
    io_usage top node around: 30kb/sec for read and 169M/sec for written
    are these metrics look normal? > @mosheg said:

    If your new cluster nodes have more memory, in case where V pick very large fact on inner,
    and force spill did not happen, the first try can take much longer because there is more memory until it will spill on the second try.
    Try to force join spill to occur on the first plan of the query, thus avoiding a replan.

    -- Here’s a query that will determine all spills in the last 24hrs:
    SELECT
    time,
    transaction_id,
    statement_id,
    request_id,
    event_type,
    event_description
    FROM dc_execution_engine_events
    WHERE (NOW() - time) < '24 hours'
    AND event_type ILIKE '%SPILL%'
    ORDER BY time DESC;

    time | transaction_id | statement_id | request_id | event_type | event_description
    -------------------------------+-------------------+--------------+------------+---------------------+-----------------------------------------------------------------------
    2020-05-19 11:28:27.776255-04 | 45035996273713491 | 1 | 0 | GROUP_BY_SPILLED | GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 11:28:27.775592-04 | 45035996273713491 | 1 | 0 | GROUP_BY_SPILLED | GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 11:28:27.77525-04 | 45035996273713491 | 1 | 0 | GROUP_BY_SPILLED | GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 10:27:16.202986-04 | 45035996273713284 | 1 | 1 | SYSQUERY_POOL_SPILL | Sysquery Memory Full; spilled to temp ROS files
    2020-05-19 10:26:51.825093-04 | 45035996273713282 | 1 | 1 | SYSQUERY_POOL_SPILL | Sysquery Memory Full; spilled to temp ROS files
    (5 rows)

    yes, new cluster has much more memory than old one.
    I saw a lots records for this query(16530 rows), how to resolve this kind issues?

    2020-05-19 09:23:56 49539595903924528 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 09:21:34 49539595903923862 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 09:13:45 49539595903920838 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 09:13:42 49539595903920838 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.
    2020-05-19 06:34:27 126100789567738258 1 0 GROUP_BY_SPILLED GROUP BY key set did not fit in memory, using external sort grouping.

  • I found out how to force join spill , do you think we need do the sql statement level or database level?

  • davdsdavds
    edited May 2020

    I will try this out

  • thanks, it makes senses now and I will try it.
    but slow performance was also on copy data to stage table sql like:
    select distinct column_a, column_b,column_c from tab;
    in this case, no join involved.
    any thoughts?

    thanks

  • [Deleted User][Deleted User] Administrator

    You could run linux tool "perf top" to identify functions were most of the time in spent and it may give you some hints. I would recommend opening support ticket and request support engineer get on a call with you to debug this.

  • has anyway to check configuration setting? I was thinking maybe some parameters were not setup properly.

  • select * from vs_configuration_parameters where current_value <> default_value;
    On config parameters changes check for the reason to modify any parameters default values. If there is no valid reason change it back to default.

  • thanks, Kumar. this is a very handy sql.
    all the parameters are defalut values. we found even simple sql like: select distinct a,b,c from table is slower than old one.
    new one has 32 nodes and old one only has 22 nodes. new clusters also has better hardware as well.
    Anything else we can check it quickly? CPU and memory only 20%.

  • Bryan_HBryan_H Vertica Employee Administrator

    Hi, one other check is resource pool status:
    select * from resource_pool_status;
    It's possible that a user-defined resource pool, or even TM, may have old settings for memory, number of threads etc. so check resource_pool_status and compare to resource pool settings. If the memory is always at max or not enough threads are in use (less than number of cores) in resource_pool_status, you might want to adjust memory and concurrency settings for the new nodes number of CPU and memory.
    Also, please check sysctl settings vm.dirty_ratio and vm.dirty_background_ratio since these may need tuning if there is more memory in the new boxes: https://www.vertica.com/kb/Tuning-Linux-Dirty-Data-Parameters-for-Vertica/Content/BestPractices/Tuning-Linux-Dirty-Data-Parameters-for-Vertica.htm

  • thanks, check resource_pool_status, we did not has user-defined resource pool. memory usage look ok.
    since this new cluster is a brand new cluster, we did not upgrade from old one. and there are more nodes and more powerful hardware.
    it should be faster...is it possible more nodes caused more data transmit?
    I did select * from table, it comes back quick. but select distinct a,b,c from table, it became slow. since sort involved?
    any thought?
    thanks

  • -Run “select 1” and other small queries to test system performance and where the txn spending more time.

    \timing
    profile select 1;
    select node_name,transaction_id,statement_id,execution_step,(completion_time - time) duration from scrutin. dc_query_executions where transaction_id=;

    • Run analyze_workload() and review the tunning_recommendations table for any recommendations.

    select analyze_workload(''); and check in tunning_recommendations table
    select time,node_name,observation_description,tuning_type,tuning_parameter,tuning_description from dc_tuning_recommendations order by time desc;

    • Check if any node is slower consistently compared to other nodes in the cluster by running a similar query like select 1;

    select 1;
    \timing
    select
    last_stmt.time as rollback_time,
    txn_end.node_name as node_name,
    txn_end.time - last_stmt.time as total_delay,
    txn_end.end_begin_time - last_stmt.time as invoke_delay,
    txn_end.time - txn_end.end_begin_time as response_delay
    from
    v_internal.dc_transaction_ends as txn_end,
    (select transaction_id, time from v_internal.dc_requests_issued
    where node_name = local_node_name() and session_id = current_session() and transaction_id != current_trans_id()
    order by time desc
    limit 1) as last_stmt
    where last_stmt.transaction_id = txn_end.transaction_id
    order by total_delay desc;

  • thanks, I have followed all the steps. I do see tuning_recommandendation ask to run designer. I have run incremental designer and apply new projections, but it seems like not help.
    I also used your sql to check any slow nodes, but no nodes has noticeable slow response time.
    since I mentioned before, when I do distinct, it become slow.
    select distinct a,b,c from table;
    is it possible master node doing sort,merge are slow? how to check that?

  • LenoyJLenoyJ - Select Field - Employee

    since this new cluster is a brand new cluster, we did not upgrade from old one. and there are more nodes and more powerful hardware.

    When you mean 'powerful hardware', has your total core count gone up?

  • yes...more nodes with better hard ware. ( 28 core CPU and 768G memory). old system is much small than new one.

  • new cluster has 32 nodes vs old cluster has 22 nodes. how to check network performance between nodes? is it sound valid reason?

  • DaveTDaveT Employee

    I recommend taking a step back from the database and running the v*perf utilities if you have not already done that. Make sure that your environment is performant before you investigate the database.

  • moshegmosheg Vertica Employee Administrator

    1) Many health check indications are covered in Vertica Advisor report. As @skamat suggested it is advised to open a support case, create and upload Scrutinize file to Vertica S3 and receive the Advisor report.
    2) Verify all projections are distributed evenly between all cluster nodes:
    select projection_name, sum(used_bytes)/1024^3 as GB from projection_storage group by projection_name order by 2;
    3) In addition, compare the output of the following query with Warning and Critical Event Types listed here:
    https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/SystemTables/MONITOR/QUERY_EVENTS.htm

    select event_type,event_description, count(*)
    from query_events
    where ( NOW() - EVENT_TIMESTAMP) < '48 hour'
    group by 1,2
    order by 3 desc;
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file