Unknown status from service script

Hi everyone, We have a watchdog process that periodically executes a "service verticad status" in order to determine if Vertica seems to be running ok. I've noticed that occasionally, Vertica responds with the status "unknown". According to the admintools log, it responds with "unknown" exactly 30 seconds after the status request comes in, so this appears to be a timeout. Vertica seems to still be running ok, despite the bad status. Possibly of note is that this happened at the same time the vertica.log rolled. Also, the system is (too) heavily loaded and is having issues responding to even simple queries in a reasonable amount of time, but that's another issue... Has anyone seen a similar issue? Is there a better way to monitor Vertica's status from our watchdog? Thanks!


  • Options
    Hi Mark, Actually, the fact that Vertica feels heavily loaded and slow to respond, and the fact that "service verticad status" returns an "unknown" status, are very likely closely related. Vertica sends its cluster-status messages over UDP. Vertica also sends its control messages (ie., COMMIT, etc) over a custom protocol that uses UDP. (In both cases so that it can leverage UDP Broadcast for efficiency.) But regular point-to-point data connections go over TCP. When the network is under heavy load with a bunch of mixed-type traffic, certain types of network switches will prioritize TCP heavily over UDP by default. So all our data gets through but our control messages are dropped by the switch. This is exactly the opposite of what you want with Vertica -- the rate of UDP throughput directly correlates with cluster responsiveness. So you may be able to fix both problems by tweaking the network settings of your switch. This varies a lot from device to device; I haven't done it a whole lot personally so I unfortunately can't really tell you what to look for... But either your IT folks or your network-hardware vendor ought to know. QoS settings are a good first place to look. In the worst case, you can run Spread (the daemon process that actually does the communication over UDP) on a separate network. A separate physical network is best, but a separate VLAN on the same switch is usually almost as good. Either way can cause a dramatic improvement, if you're unable to tune your switch otherwise and if you are in fact seeing a lot of dropped UDP packets. Setting up Spread on a different network can be a little tricky; for the Vertica EE, our Support folks can help you with it if necessary. In any case, if you just want a workaround for the timeout issue, you could create a minimally-privileged user account on your cluster, then have a script that runs vsql and logs in as that account and runs some variant of "SELECT * FROM NODES;". The "NODES" system table lists all the nodes and tells you their status. Adam
  • Options
    Hi Adam, Thanks for the details - this is very useful. In the end, we really just need a reliable way of monitoring Vertica for use in our watchdog process. It sounds like we shouldn't base our watchdog on "service verticad status" since it's really a "best effort status" that it returns. We can switch to "select * from nodes" as you suggested. Thanks! Mark
  • Options
    Hi Mark, Glad that helped! For what it's worth, a warning: "service verticad status" is really only a best-effort status insofar as UDP is unreliable. (If you have UDP packets flowing properly, it will work as expected.) If UDP packets are being aggressively dropped by your network hardware, Vertica itself doesn't guarantee correct operation. If a node stops responding to control messages, Vertica will eventually assume that that node is bad, and it will drop out of the cluster. (It'll try to re-join and recover as soon as the network is restored.) If the switch is dropping packets for all nodes, then eventually you may see several nodes drop out; eventually the cluster may do a safety shut-down waiting for the network to start responding. (This is a known issue; we're working to address it, but at some level, if the switch is dropping our control messages, there's a limit to what we can do...) It sounds like this isn't an issue for you currently? But if you see nodes starting to drop out of the cluster when the cluster is under load, it's probably worth looking into. Adam

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file