Hunting for cause of high disk read rate

Sporadically our 6.1.1 cluster of 12 nodes hammers disk, reading about 2G per second. Episodes last for about two hours. During these episodes, query rates are cut in half.

This chart shows recent occurrences.

image

Casual reading of vertica.log reveals no unusual queries, no recovery events or exceptional moveouts/mergeouts.

I am looking for troubleshooting techniques.

Comments

  • Hi!
    Casual reading of vertica.log reveals no unusual queries, no recovery events or exceptional moveouts/mergeouts.
    Still, it looks like some oscillation, first peak close to hour 19:00, after it comes another peak:
    27/12 - 19:00
    30/12 - 19:00
    07/01 - 19:00
    09/01 - 19:00

    If its not some scheduled process, how can you explain hour 19:00? Can you apply FFT on data and tell us what is a time period or frequency? Im pretty sure its some oscillation.
  • I might have found the problem. I noticed that log file, /opt/vertica/log/agent.log showed activity that coincided with the high disk IO charted above.In particular, there were POST's in addition to the GET's being logged.

    We don't use MC, so I stopped vertica_agent. There have been no high disk IO incidents for three days since stopping vertica_agent.
  • Still looks good.
    image
  • But is there a downside to disabling vertica_agent? Does vertica_agent perform valuable housekeeping that, if neglected, will degrade performance?
  • vertica_agent is only for MC, if you disable it should not do any issue unless you use MC.

    You can start and stop and re-verify if vertica_agent is causing this high disk IO incidents.
    /etc/init.d/vertica_agent start
    /etc/init.d/vertica_agent stop

    What activity was reported in agent.log file?


  • Mostly GET's, but during incidents many POST's appear.
  • The high disk IO reemerged. Resuming the hunt.

    image

  • To explore the oscillation theory, I changed a cron job, shifting it 20 minutes; it was run fifteen minutes before episodes occurred. It makes a query to select distinct rows from a large table, taking 8 minutes to run. Six days later, no high disk io events have occurred.

    I'm stumped as to why.
  • Hi!


    When automatic audit of database size occurs? What is a time of your cron job - 18:45? Blue color - its IO reads only?
    SELECT get_audit_time(); 
    Audit do a massive IO reads on large databases. 

    Some tips(in case it reemerged):
    1. Increase a verbosity for logging:
    select set_loglevel(6);    -- level 6 means trace
    2. blktrace - generate traces of the i/o traffic on block devices
    http://www.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materialy/10%20-%20Dysk/gelato_ICE0...
    http://prefetch.net/blog/index.php/2009/02/16/tracing-block-io-operations-on-linux-hosts-with-blktra...

    3. iotop - simple top-like I/O monitor
    iotop should show you what(witch a process/thread) do a massive IO reads.


    Regards.
  • Vertica is driving the high IO as seen in this chart where the blue is IO utilization and the red is read IO / sec made by Vertica. We are testing version 7 to see if this behavior is eliminated.

    image
  • This query almost always appears before these episodes:
    at epoch latest select distinct d_id, i_id from some_table order by d_id, i_id;
    This query usually finishes in under a minute. But during episodes it runs for over two hours. When episodes occur this query is already running.

    The table projections are ordered and segmented by d_id and i_id. The table has a little over 7T rows.
  • Hi!

    What about deletes for "some_table"?  Delete vectors can slow a query. By docs its not recommended more than 10% of deleted data, from my expicrience this number is smaller.
    select * from delete_vectors ...
  • We don't delete due to the performance impact.
    vertica=> select * from delete_vectors;
     node_name | schema_name | projection_name | storage_type | dv_oid | storage_oid | deleted_row_count | used_bytes | start_epoch | end_epoch | is_sorted
    -----------+-------------+-----------------+--------------+--------+-------------+-------------------+------------+-------------+-----------+-----------
    (0 rows)
  • Are you on RHEL/CentOS 6.x?  If yes you could use the 'perf' tool to review the stack traces to see where time is being spent.  

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file