Machines not being fully utilized

Hi, We're trying to lower the execution time of some of our queries, but are seeing that the 4 machines in our cluster barely break a sweat: CPU, RAM and disk usage is minimal. What should we check to determine why? ~ Jaka

Comments

  • Hi Jaka, Hm... Just to get a better sense, how big is your cluster? (Is this a 5-node cluster and just one is busy? Is it a 50-node cluster and most are busy?) Also, are these four nodes running any different hardware than your other nodes? Vertica is intended to run on a cluster of computers of all the same hardware; if these four are much more powerful, then they will be largely idle. Other things to check: - Are your nodes configured identically? Even if they have the same hardware, there are many software changes (often in particular hardware drivers and/or RAID configuration) that can cause machines to perform better or worse. - Are your tables segmented properly? Do you have a lot of data skew, either due to a custom segmentation expression (other than hash()) or due to having lots and lots of duplicate values? - Are you using these four nodes as query initiators? (Are you ever connecting to them directly with vsql, or with whatever program you're using to run the queries?) The query-initiator node does more work with some queries. - Are you loading a bunch of data? Are the data files that you're loading (or the incoming data streams, if over JDBC/ODBC/COPY LOCAL/etc) evenly distributed among all of your nodes? Or are these nodes not being asked to parse incoming data? Adam
  • Hi Adam, This is a cluster of 4 dual-hexacore nodes, each with 128GB of RAM and 4 attached drives. They are exactly the same, installed at the same time, ... I'm doing: SELECT count(*) FROM a JOIN b USING x, y, z; Segmentation of both is by MODULARHASH(x, y), which distributes the rows very equally. This is a MERGE JOIN with PUSHED GROUPING. I am using these nodes as initiatiors, yes (connecting to them using vsql). Jaka
  • Hi Jaka, Ah, I'm sorry, I mis-read. It's not that just these four nodes are underutilizing system resources but other nodes are fully utilized?; it's that the whole cluster isn't max'ing out its system resources. In that case there's a fourth system resource that is important: Is your network fully utilized? (Either by Vertica or by any other traffic on the network, if it's not a dedicated network.) Also, how long are these queries? Vertica is optimized for longer-running queries; if you're running lots of sub-second queries, Vertica may be spending all of its time planning queries, and query planning in Vertica is not currently multi-threaded so it will sit there using just one of all of those CPU cores. Adam
  • Hi Adam, I will check, but given that: 1. the tables are identically segmented by a subset of the join key, so no data needs to be distributed for the join, and 2. the aggregation is simply count(*), meaning each node can count locally and then send just it's total to the initiator (I'm assuming this is what PUSHED GROUPING means?) I would assume there would be almost no network traffic, no?
  • Hi Jaka, Yes, in that case, I would expect there to be very little network traffic. (At least, traffic from Vertica. If there is other heavy traffic on the network, that can slow down Vertica's traffic, sometimes dramatically. Note that Vertica's control logic goes over UDP, for various reasons including better broadcast support. If you have other TCP-heavy protocols stressing the same network, some switches will starve UDP traffic as "lower priority", which can slow Vertica dramatically. Part of why we recommend a dedicated back-end network for Vertica -- if you have that, you should be good.) This does seem somewhat unlikely in this scenario, though, as you say. Speaking to the other idea in my last post, are your queries quite cheap / quick to execute? COUNT(*) queries often do run quite quickly. Adam
  • Also, sorry if this was the very first thing you checked, but, is there a bottleneck on the tool or tools that is/are issuing all of these queries? This also seems unlikely in your scenario. But we do periodically see cases where someone switched from another DBMS to Vertica; previously the other DBMS was the bottleneck, but Vertica is enough faster that it's responding faster than the scripts in question can consume the data, so the bottleneck is now in those scripts or tools.
  • Hi Adam, I tested the query again to check network utilization, and it's not even nearly saturated. Generally, both disk and network are 10-20% (as reported by iostat and pktstat), with occasional short spikes higher, even to 100%. To answer your questions: - This query is not short/quick... for a week it takes half a minute. For 4 months, bit less than 10 minutes (so pretty linear). - I'm running this from vsql, as as single query returning 1 row, not from an application. Let me know if if you have any other ideas for what to check. I'm also happy to give you access to the nodes and the exact query that I run, if you want to check things yourself. Jaka
  • Hi Jaka, Hm... Are you running only one query at once on the cluster? Also, what version of Vertica are you running? Depending on your version, do you have local segmentation enabled? (And, if it was previously disabled or this is an upgrade DB, have you rebalanced the data to cause it to actually be stored segmented?) See "Local Data Segmentation" in the Administrator's Guide: https://my.vertica.com/docs/6.1.x/PDF/HP_Vertica_6.1.x_AdminGuide.pdf Certain individual queries within Vertica do not fully utilize the whole system. This is a known issue; also one that we are actively working on, so I would encourage you to make sure that you're running the latest version of Vertica. The typical issue is that they are bottlenecked fully utilizing one or a few CPU cores, rather than all of your cores. The workaround, if you can't upgrade and/or have a query that we have not yet addressed, is to run multiple queries at once. This will not result in any single query going faster, but it will get more work done per unit time. Adam
  • Hi Adam, We're running v6.1.1-0. is_local_segment_enabled in elastic_cluster seems to be false. We have not added or removed nodes to this cluster and the data seems to be balanced well. We'll try enabling local data segmentation, assuming it's one of those queries not yet optimized, and see if it improves performance.
  • Hi Adam, Sorry for taking so long... We've enabled local segmentation (scaling factor 16), but the machines are still not being fully utilized :( Do you have any other suggestions for troubleshooting this? If you're willing to, we can give you access to the cluster and you can "look around". (we've also tried support, but the progress we've made there is light years behind this thread) Best, Jaka
  • Hi Jaka, Hm, for local segmentation, how long did it take you to enable it? (seconds or hours?) There is a switch that you can flip that will enable local segmentation going forward. But, well: Segmentation, as you probably know?, is the mechanism by which we split up a table among all the nodes in a cluster. Local segmentation keeps going -- it further splits data into a number of segments (16, in your case) on each node. This has a number of advantages; one of them is that, for appropriately-parallelizable operations, it's trivial for us to do 16 of them at once on a node. That is, if the data is actually split. If you just flip the setting, that doesn't mean that existing files have been split up. Most likely most of your existing data is still in one monolithic segment. You'd have to explicitly re-segment. If it still doesn't help, then I'm not sure... It might help to see an example of your queries. I appreciate your offer to log into the cluster, but am reluctant to take you up on it -- that's Support's (and Professional Service's -- they might actually be a better fit for this issue?) job; there are many of you, many of them, but just one of me, and I do have a day job too :-) But stuff on the forums is good; others with the same problem may read it too. Adam
  • Adam, It took several hours. It seems it did actually resegment correctly: dbadmin=> select * from elastic_cluster; scaling_factor | 16 maximum_skew_percent | 15 segment_layout | v_celtra_node0001[25.0%] v_celtra_node0002[25.0%] v_celtra_node0003[25.0%] v_celtra_node0004[25.0%] local_segment_layout | v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0002[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0003[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0004[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] v_celtra_node0001[1.6%] version | 39 is_enabled | t is_local_segment_enabled | t is_rebalance_running | f We will keep interacting with support. Thanks for your help so far!
  • Hi Adam, Sorry to resurrect this old thread. We've been working with support for more than a month now and haven't been able to find the cause of why the cluster isn't either CPU-, memory-, I/O- or network-bound. We're about to start dealing with professional services, but I just thought of a different approach that I wanted to quickly explore here with you: Even if I create the most trivial table and query: 1. create table a (i int not null) segmented by modularhash(i) all nodes; 2. fill with lots of unique integers 3. select count(*) from a t1 natural join a t2; The cluster will still use only ~3 out of 12 cores and negligible memory/disk/network. Doing something other than a join, e.g.: select count(*) from a where i%4 = 0; uses all the CPU cores just fine and is also very fast. Here's the idea: are you able to help me to create *any* JOIN query that would make the cluster sweat and hit some resource bottleneck? If so, I can then bisect and go step-by-step from this trivial example to our schema, data and queries, to see where the problem lies. Best, Jaka
  • I should add that: - if I execute 1 query, I see ~300% CPU use and query completes in 200s - if I execute 2 identical queries, I see ~600% CPU use and they still complete in 200s. This, I think, proves there's no resource contention going on, but it indeed is just inefficient use of available resources.
  • We discovered a big flaw in our thinking so far. We've assumed that the query *does* in fact scale to 3 out of 12 cores of a node: that there are 3 similar threads doing the join, but for some reason not 12. This, however, is not the case: There is one thread pegged at 100%, and dozens of others add up to ~150-250%. There is indeed a single-threaded operation. execution_engine_profiles shows many Join operators, however only 4 of them (the number of nodes) and 4 StorageMerges take significant time:
          node_name     |      user_id      | user_name |      session_id      |  transaction_id   | statement_id |     plan_id     | operator_name | operator_id | baseplan_id | path_id | localplan_id | activity_id | resource_id |    counter_name     | counter_tag | counter_value | is_executing   -------------------+-------------------+-----------+----------------------+-------------------+--------------+-----------------+---------------+-------------+-------------+---------+--------------+-------------+-------------+---------------------+-------------+---------------+--------------   v_celtra_node0004 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133153 | StorageMerge  |           9 |          12 |       4 |            9 |        1004 |          -1 | execution time (us) |             |      29734827 | f   v_celtra_node0002 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133657 | StorageMerge  |           9 |          12 |       4 |            9 |        1004 |          -1 | execution time (us) |             |      29692425 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | Join          |           6 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |      29680874 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | StorageMerge  |           9 |          12 |       4 |            9 |        1004 |          -1 | execution time (us) |             |      29621926 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | Join          |           9 |           9 |       2 |            9 |        1008 |          -1 | execution time (us) |             |      28242513 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | StorageMerge  |          12 |          12 |       4 |           12 |        1004 |          -1 | execution time (us) |             |      28135167 | f   v_celtra_node0002 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133657 | Join          |           6 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |      26426603 | f   v_celtra_node0004 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133153 | Join          |           6 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |      25546181 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | GroupByPipe   |          11 |          11 |       4 |           11 |        1005 |          -1 | execution time (us) |             |       9494058 | f   v_celtra_node0002 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133657 | GroupByPipe   |           8 |          11 |       4 |            8 |        1005 |          -1 | execution time (us) |             |       9313600 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | GroupByPipe   |           8 |          11 |       4 |            8 |        1005 |          -1 | execution time (us) |             |       9280676 | f   v_celtra_node0004 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133153 | GroupByPipe   |           8 |          11 |       4 |            8 |        1005 |          -1 | execution time (us) |             |       9252188 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | Join          |          96 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6930670 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | Join          |         104 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6882578 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | Join          |         128 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6867377 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | Join          |          92 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6785428 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | Join          |         100 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6739495 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | Join          |         132 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6610228 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | Join          |         111 |           9 |       2 |            9 |        1008 |          -1 | execution time (us) |             |       6595079 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | Join          |          91 |           9 |       2 |            9 |        1008 |          -1 | execution time (us) |             |       6594897 | f   v_celtra_node0004 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133153 | Join          |         132 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6572497 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | Join          |          84 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6562806 | f   v_celtra_node0003 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133742 | Join          |         120 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6542962 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | Join          |          99 |           9 |       2 |            9 |        1008 |          -1 | execution time (us) |             |       6462549 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | Join          |         123 |           9 |       2 |            9 |        1008 |          -1 | execution time (us) |             |       6443112 | f   v_celtra_node0002 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133657 | Join          |          96 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6407782 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | Join          |         103 |           9 |       2 |            9 |        1008 |          -1 | execution time (us) |             |       6388187 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | Join          |         115 |           9 |       2 |            9 |        1008 |          -1 | execution time (us) |             |       6339533 | f   v_celtra_node0002 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133657 | Join          |         104 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6329406 | f   v_celtra_node0002 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133657 | Join          |         128 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6316513 | f   v_celtra_node0004 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133153 | Join          |          96 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6310400 | f   v_celtra_node0001 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961130567 | Join          |         131 |           9 |       2 |            9 |        1008 |          -1 | execution time (us) |             |       6294010 | f   v_celtra_node0002 | 45035996273704962 | dbadmin   | node01-8669:0x399cf3 | 45035996277940580 |            1 | 430530961133657 | Join          |          92 |           9 |       2 |            6 |        1008 |          -1 | execution time (us) |             |       6289111 | f  ...  
  • Hey Jaka, Hm, now you've got me curious. Something I don't understand :-) So I've done some more digging. Here's another experiment for you: Rather than creating just one table, could you create two tables? Same schema (just one integer); pk/fk relationship. But rather than both tables being huge, specifically make the pk table tiny. (ie., a typical fact/dimension setup.) I believe you'll learn something from the experiment :-) Adam
  • Hi Adam, My original self-join was 10^9. I created 10^8 facts for 10^4 dimensions. It's way faster, but this is to be expected, since there is much less (physical) data to work on (the fact table has few unique values). Then I created 10^10 facts for 10^8 dimensions, which is slower. CPU usage is lower (single thread still pegging, but that's it). One difference in execution_engine_profiles is that now 4 Joins are taking the longest, no longer Joins *and* StorageMerges. I'm afraid the enlightenment eludes me :)
  • Hi Jaka, Sorry -- you're on the right track, with 10^8 facts and 10^4 dimensions. Keep growing the fact table, though; a couple more orders of magnitude should do it :-) Make the data big enough that the query takes long enough to run that observing its characteristics while running is meaningful. My particular observation is that, in this case, CPU utilization increases. (I would argue that the query also goes faster, but that's not meaningful since as you say the data's quite different :-) ) The underlying dilemma is that we parallelize across nodes by segmenting data, and until very recently we didn't provide the option to further segment data within a node. (Even now it's off by default.) Adam
  • Hi Adam, I'm now creating a table of 10^13 facts referencing 10^4 dimensions. This fact table has the same number of distinct values — 10^9 — as the original table I was self-joining. We'll see in a while... However, if you're referring to local segmentation, we've already tried that (and discussed it above in this thread), but it makes no difference.
  • Hi!

    Very interested on conclusion from this research / troubleshooting.

    Were you able to find way to speed things up?


  • No, but we figured it out in the end: Vertica does not support intra-node parallelization of merge joins.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file