Error: 'Could not send data to client: No such file or directory'
Hi all,
Recently we're running into issues with queries failing for no apparent reason with the error message 'Could not send data to client: No such file or directory'. The logs show the following:
$ cat vertica.log | grep 501c8bff 2019-01-30 13:23:54.849 Init Session:7f0a70ff9700-a00000501c8bff [Txn] <INFO> Begin Txn: a00000501c8bff 'SELECT article_1, 2019-01-30 13:24:04.916 Init Session:7f0a70ff9700-a00000501c8bff [EE] <INFO> Query runtime exceeds limit, canceling 2019-01-30 13:28:54.924 Init Session:7f0a70ff9700-a00000501c8bff [EE] <INFO> Query runtime exceeds limit, canceling [a00000501c8bff,1] - Queries:1,Threads:29,File Handles:104,Memory(KB):1102659, 2019-01-30 13:58:54.025 EEThread:7f09a0cca700-a00000501c8bff <LOG> @v_node0001: VX001/2907: Could not send data to client: No such file or directory 2019-01-30 13:58:57.444 Init Session:7f0a70ff9700-a00000501c8bff [EE] <INFO> Query can't be replanned due to partial output from initial execution 2019-01-30 13:58:57.444 Init Session:7f0a70ff9700-a00000501c8bff <FATAL> @v_node001: 08006/2607: Client has disconnected 2019-01-30 13:58:57.445 Init Session:7f0a70ff9700-a00000501c8bff [Txn] <INFO> Rollback Txn: a00000501c8bff 'SELECT article_1,
The second and third lines correspond to the query cascading to the next resource pool. It seems the query failed exactly 30 minutes after the final cascade.
We also see the same issue when executing a number of similar queries directly after each other (same select statement, different time windows). In that case the failure also occurs roughly 30 minutes after the start of the first query, even though the final failed query had been running for only a few minutes at that point.
These queries are all executed by connecting directly to a node in the cluster, so there are no load balancer timeouts that might be involved. Nowhere in our settings have we defined a timeout of 30 minutes, not on the resource pool nor for users individually.
We're running Vertica v9.0.1-0 on a 4 node cluster. To execute these queries we use pyODBC and vertica_python from Python applications, and we see the same issue with both modules. I have not yet encountered the issue with vsql or JDBC.
Any help on what could cause this error would be much appreciated.
Comments
HI,
The explicitly says "Query runtime exceeds limit, canceling".
Check the runtimecap for all resource pools involved.
I can mimic your vertica.log error as follows:
Example:
My log entry matched yours:
Plus if you go a little further in the log, I see this message:
2019-01-31 09:59:53.250 Init Session:7ff798aff700-a000000014ca19 <ERROR> @v_test_db_node0001: 57014/3326: Execution time exceeded run time cap of 00:00:02
Hi Jim,
Thanks for the quick reply:)
I should have been more clear; the messages indicating 'Query runtime exceeds limit, canceling' are expected since we have three cascading resource pools with increasing runtimecaps (10s, 5m, 2h) and they're just part of the cascading process.
The issue is in the next section stating 'Could not send data to client: No such file or directory', and then 'Client has disconnected'. I'm pretty sure the client application does not actually close the connection, since it's not giving any issues for similar queries, but somehow Vertica thinks it does. Applications that use the vertica_python module keep hanging indefinitly, while applications using pyODBC exit with an error status, but no message. And for some reason when this happens, it's always after 30 minutes, yet we have no timeout setting anywhere for that amount of time (we do for some users have a longer timeout of 1 hour and 2 hours on the final resource pool). Also if some timeout was hit, I would expect an appropriate message in the logs, not 'Could not send data to client: No such file or directory'.
Currently this only happens for two queries, but it happens consistently. Both queries are pretty straight forward select statements, but they do fetch a lot of data (millions of rows). The fact that the issue never occured before and other queries are unaffected makes me doubt that it's an issue in our applications.
Hope this clarifies it somewhat. I am still at a loss as to what might cause this, so I'm also not sure I'm explaining it correctly:)
You might be hitting a known issue where while your query is in the initial pool, Vertica returns records to the client, but before the query finishes, the run time cap is hit... So Vertica will try to cancel the query and queue it on the new pool.
However, the query has already sent output rows to the client! In this case, Vertica does not know how to restart it and skip sending the rows back to the client that had already been sent.
There is an open JIRA on this issue and I will keep you updated on its progress.
Can you please send me your company's name so that I can add it to the JIRA as a client needing a resolution ASAP?
My email is james.knicely@microfocus.com
You might be able to mitigate the issue by adding an explicit ORDER BY on a unique key in the query.
@Jim_Knicely : Can you please update JIRA ID/Link if this is already taken care.