Vertica cluster with unbalanced work load cross nodes
John_Y
✭
We have a vertica cluster with 5 nodes in our company, but one of the nodes (node 5), always has significant higher work load than other nodes, high IO, memory, #number of running queries, and number of threads are all much higher.
Please refer to the following figure, can someone help to explain why?
Thank you
0
Comments
For what I see in the graph you are also running more queries in NODE05 compared to the others. Do you have a load balancer?
With a query like this you can verify how many queries you run in each of the nodes.
select node_name, date_trunc('day',query_start::timestamp) day_query, count(*) from query_profiles group by 1,2 order by 2 desc,1 limit 100;
Check if you do have more in node5 and if so try to rebalance the workload between the nodes.
Hi John,
Emoreno's suggestion is a good way to start investigating. Just to modify it, please add "query_type" too in above SQL to see which type of query is not evenly load balances by your application/load_balancer.
In my experience, it is seen that certain type of load creates like "LOAD" events are evenly distributed but 1 application is only firing SELECT DML's to 1 node only which causes this imbalance.
Try using this SQL
select node_name, date_trunc('day',query_start::timestamp) day_query,query_type,count(*) from query_profiles group by 1,2,3 order by 2 desc,1 limit 100;
it has to be some Application issue which is connecting to database.
Regards,
Raghav Agrawal.
thank you very much for your help, those queries do help!
Glad to hear that.
Is your issue solved?? Have a similar issue, but concluded that a projection change is needed.
we were able to find some queries which hits directly to one of our nodes, and redirect them to the load balance, but we are still having the problem of high IO usage., the IO is almost 100% all the time, although CPU and memory are reasonable. Overall queries speed is super slow. Please take a look at the graph below
Can anybody suggest us some ways to determine the reason of problem?
Thank you.
Have you heard of the script vioperf? I wonder if that would help you diagnose what's going on. https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/InstallationGuide/scripts/vioperf.htm?Highlight=vioperf
Thank you, I just tested and here's the result, it looks like the throughput is really low? any way to dive deep into it?
@John_Y I would suggest, profile the queries the db has. Run a designer to get to see what projections you are missing and then Check the tuple mover operations
@John_Y , if you run vioperf while the database is running the numbers are going to be slow as vioperf simulates work load + the workload of the database. I recommend you to run it with the database down or not workload and/or involve your sysadmin or hardware people to tell you if the IO is slow.
You can monitor the active workload with SAR tools or iostat but there is nothing that you can do in Vertica, if the disk are slow you need to check with infrastructure as seems that you are ioboud.
Does make sense?
@ckotsidimos projection design for a high IO is not the right solution. There is way that you can reduce the storage with some type of encoding so you have less to read but those will pay in CPU cycles. You need to meet the Vertica recommendations in IO, if you don't meet that you will be IO bound. There is some info in the doc : https://my.vertica.com/kb/Redesigning-Projections-for-Query-Optimization/Content/BestPractices/Redesigning-Projections-for-Query-Optimization.htm
I am wondering why is IO, it is right or writes? On writes you can improve the performance if you reduce the number of projections for the same table but for read you may affect the query performance if you change the configuration just to reduce the storage foot print.
Hope that make sense.