Hunting for cause of high disk read rate

Jack_Kidwell · January 2015

Sporadically our 6.1.1 cluster of 12 nodes hammers disk, reading about 2G per second. Episodes last for about two hours. During these episodes, query rates are cut in half.

This chart shows recent occurrences.

Casual reading of vertica.log reveals no unusual queries, no recovery events or exceptional moveouts/mergeouts.

I am looking for troubleshooting techniques.

id10t · January 2015

Hi!

Casual reading of vertica.log reveals no unusual queries, no recovery events or exceptional moveouts/mergeouts.

Still, it looks like some oscillation, first peak close to hour 19:00, after it comes another peak:
27/12 - 19:00
30/12 - 19:00
07/01 - 19:00
09/01 - 19:00

If its not some scheduled process, how can you explain hour 19:00? Can you apply FFT on data and tell us what is a time period or frequency? Im pretty sure its some oscillation.

Jack_Kidwell · January 2015

I might have found the problem. I noticed that log file, /opt/vertica/log/agent.log showed activity that coincided with the high disk IO charted above.In particular, there were POST's in addition to the GET's being logged.

We don't use MC, so I stopped vertica_agent. There have been no high disk IO incidents for three days since stopping vertica_agent.

Jack_Kidwell · January 2015

Still looks good.

Jack_Kidwell · January 2015

But is there a downside to disabling vertica_agent? Does vertica_agent perform valuable housekeeping that, if neglected, will degrade performance?

Prasanta_Pal · January 2015

vertica_agent is only for MC, if you disable it should not do any issue unless you use MC.

You can start and stop and re-verify if vertica_agent is causing this high disk IO incidents.
/etc/init.d/vertica_agent start
/etc/init.d/vertica_agent stop

What activity was reported in agent.log file?

Jack_Kidwell · January 2015

Mostly GET's, but during incidents many POST's appear.

Jack_Kidwell · January 2015

The high disk IO reemerged. Resuming the hunt.

Jack_Kidwell · January 2015

To explore the oscillation theory, I changed a cron job, shifting it 20 minutes; it was run fifteen minutes before episodes occurred. It makes a query to select distinct rows from a large table, taking 8 minutes to run. Six days later, no high disk io events have occurred.

I'm stumped as to why.

id10t · January 2015

Hi!

When automatic audit of database size occurs? What is a time of your cron job - 18:45? Blue color - its IO reads only?

SELECT get_audit_time();

Audit do a massive IO reads on large databases.

Some tips(in case it reemerged):
1. Increase a verbosity for logging:

select set_loglevel(6);    -- level 6 means trace

2. blktrace - generate traces of the i/o traffic on block devices
http://www.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materialy/10%20-%20Dysk/gelato_ICE0...
http://prefetch.net/blog/index.php/2009/02/16/tracing-block-io-operations-on-linux-hosts-with-blktra...

3. iotop - simple top-like I/O monitor
iotop should show you what(witch a process/thread) do a massive IO reads.

Regards.

Jack_Kidwell · January 2015

Vertica is driving the high IO as seen in this chart where the blue is IO utilization and the red is read IO / sec made by Vertica. We are testing version 7 to see if this behavior is eliminated.

Jack_Kidwell · January 2015

This query almost always appears before these episodes:

at epoch latest select distinct d_id, i_id from some_table order by d_id, i_id;

This query usually finishes in under a minute. But during episodes it runs for over two hours. When episodes occur this query is already running.

The table projections are ordered and segmented by d_id and i_id. The table has a little over 7T rows.

id10t · January 2015

Hi!

What about deletes for "some_table"? Delete vectors can slow a query. By docs its not recommended more than 10% of deleted data, from my expicrience this number is smaller.

select * from delete_vectors ...

Jack_Kidwell · January 2015

We don't delete due to the performance impact.

vertica=> select * from delete_vectors;

 node_name | schema_name | projection_name | storage_type | dv_oid | storage_oid | deleted_row_count | used_bytes | start_epoch | end_epoch | is_sorted

-----------+-------------+-----------------+--------------+--------+-------------+-------------------+------------+-------------+-----------+-----------

(0 rows)

Sharon_Cutter · January 2015

Are you on RHEL/CentOS 6.x? If yes you could use the 'perf' tool to review the stack traces to see where time is being spent.

We're Moving!

Create My New Community Account Now

Hunting for cause of high disk read rate

Comments

Leave a Comment