Error: 'Could not send data to client: No such file or directory'

Hi all,

Recently we're running into issues with queries failing for no apparent reason with the error message 'Could not send data to client: No such file or directory'. The logs show the following:

$ cat vertica.log | grep 501c8bff
2019-01-30 13:23:54.849 Init Session:7f0a70ff9700-a00000501c8bff [Txn] <INFO> Begin Txn: a00000501c8bff 'SELECT article_1,
2019-01-30 13:24:04.916 Init Session:7f0a70ff9700-a00000501c8bff [EE] <INFO> Query runtime exceeds limit, canceling
2019-01-30 13:28:54.924 Init Session:7f0a70ff9700-a00000501c8bff [EE] <INFO> Query runtime exceeds limit, canceling


  [a00000501c8bff,1] - Queries:1,Threads:29,File Handles:104,Memory(KB):1102659,


2019-01-30 13:58:54.025 EEThread:7f09a0cca700-a00000501c8bff <LOG> @v_node0001: VX001/2907: Could not send data to client: No such file or directory
2019-01-30 13:58:57.444 Init Session:7f0a70ff9700-a00000501c8bff [EE] <INFO> Query can't be replanned due to partial output from initial execution
2019-01-30 13:58:57.444 Init Session:7f0a70ff9700-a00000501c8bff <FATAL> @v_node001: 08006/2607: Client has disconnected
2019-01-30 13:58:57.445 Init Session:7f0a70ff9700-a00000501c8bff [Txn] <INFO> Rollback Txn: a00000501c8bff 'SELECT article_1,

The second and third lines correspond to the query cascading to the next resource pool. It seems the query failed exactly 30 minutes after the final cascade.

We also see the same issue when executing a number of similar queries directly after each other (same select statement, different time windows). In that case the failure also occurs roughly 30 minutes after the start of the first query, even though the final failed query had been running for only a few minutes at that point.

These queries are all executed by connecting directly to a node in the cluster, so there are no load balancer timeouts that might be involved. Nowhere in our settings have we defined a timeout of 30 minutes, not on the resource pool nor for users individually.

We're running Vertica v9.0.1-0 on a 4 node cluster. To execute these queries we use pyODBC and vertica_python from Python applications, and we see the same issue with both modules. I have not yet encountered the issue with vsql or JDBC.

Any help on what could cause this error would be much appreciated.

Comments

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    HI,

    The explicitly says "Query runtime exceeds limit, canceling".

    Check the runtimecap for all resource pools involved.

    I can mimic your vertica.log error as follows:

    Example:

    dbadmin=> ALTER RESOURCE POOL all_users RUNTIMECAP '2 seconds';
    ALTER RESOURCE POOL
    
    dbadmin=> ALTER USER jim resource pool all_users;
    ALTER USER
    
    dbadmin=> SELECT name, runtimecap FROM resource_pools WHERE name IN ('general', 'all_users');
       name    | runtimecap
    -----------+------------
     general   |
     all_users | 00:00:02
    
    dbadmin=> \c - jim
    You are now connected as user "jim".
    
    dbadmin=> SELECT COUNT(*) FROM big_fact2 CROSS JOIN big_fact2 b;
    ERROR 3326:  Execution time exceeded run time cap of 00:00:02
    

    My log entry matched yours:

        2019-01-31 09:55:31.534 Init Session:7ff793fff700-a000000014c9f8 [Session] <INFO> [Query] TX:a000000014c9f8(v_test_db_node0001-264917:0x18bd3b) SELECT COUNT(*) FROM big_fact2 CROSS JOIN big_fact2 b;
        2019-01-31 09:55:33.546 Init Session:7ff793fff700-a000000014c9f8 [EE] <INFO> Query runtime exceeds limit, canceling
    

    Plus if you go a little further in the log, I see this message:

    2019-01-31 09:59:53.250 Init Session:7ff798aff700-a000000014ca19 <ERROR> @v_test_db_node0001: 57014/3326: Execution time exceeded run time cap of 00:00:02

  • Hi Jim,

    Thanks for the quick reply:)

    I should have been more clear; the messages indicating 'Query runtime exceeds limit, canceling' are expected since we have three cascading resource pools with increasing runtimecaps (10s, 5m, 2h) and they're just part of the cascading process.

    The issue is in the next section stating 'Could not send data to client: No such file or directory', and then 'Client has disconnected'. I'm pretty sure the client application does not actually close the connection, since it's not giving any issues for similar queries, but somehow Vertica thinks it does. Applications that use the vertica_python module keep hanging indefinitly, while applications using pyODBC exit with an error status, but no message. And for some reason when this happens, it's always after 30 minutes, yet we have no timeout setting anywhere for that amount of time (we do for some users have a longer timeout of 1 hour and 2 hours on the final resource pool). Also if some timeout was hit, I would expect an appropriate message in the logs, not 'Could not send data to client: No such file or directory'.

    Currently this only happens for two queries, but it happens consistently. Both queries are pretty straight forward select statements, but they do fetch a lot of data (millions of rows). The fact that the issue never occured before and other queries are unaffected makes me doubt that it's an issue in our applications.

    Hope this clarifies it somewhat. I am still at a loss as to what might cause this, so I'm also not sure I'm explaining it correctly:)

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited January 2019

    You might be hitting a known issue where while your query is in the initial pool, Vertica returns records to the client, but before the query finishes, the run time cap is hit... So Vertica will try to cancel the query and queue it on the new pool.

    However, the query has already sent output rows to the client! In this case, Vertica does not know how to restart it and skip sending the rows back to the client that had already been sent.

    There is an open JIRA on this issue and I will keep you updated on its progress.

    Can you please send me your company's name so that I can add it to the JIRA as a client needing a resolution ASAP?

    My email is james.knicely@microfocus.com

    You might be able to mitigate the issue by adding an explicit ORDER BY on a unique key in the query.

  • stiwaristiwari Vertica Customer

    @Jim_Knicely : Can you please update JIRA ID/Link if this is already taken care.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file