Broken Pipe using HDFS connector with wildcard

Hi - 

 

I am trying to issue a copy statement that uses the hdfs connector with a wildcard to import all part files from an hdfs directory such as:

 

copy my_events source hdfs(url='http://my.server.com:50070/webhdfs/v1/path/to/my_events/part-000*', username='hdfs') delimiter E'\t';

 

If i specify just one file, or a handful of files, it works perfectly, however, once i try to do too many files it seems, i.e. 'part-000*' - I get this error (i.e. multiple times):

 

ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): Broken pipe
ERROR 3399: Failure in UDx RPC call InvokeDestroyUDL(): UDx side process has exited abnormally
ERROR 3399: Failure in UDx RPC call InvokeSetupUDL(): UDx side process has exited abnormally

 

I'm wondering if this is a problem on the vertica side, or on the webhdfs side? And what to do about it.

 

My wild intuition tells me that maybe one of these is happening?:

 - vertica is somehow overloading webhdfs (i see no errors in the hdfs audit logs though) with too many requests?

 

 - that there is some sort of filesystem limit (number of open files? or otherwise?) that vertica or webhdfs is hitting?

 

- other configuration settings on either side that are preventing this?

 

If anyone has any idea how to debug this I'd appreciate it!

Comments

  • Hi All - 

     

    After doing an investigation of the logs, specifically the UDxFencedProcesses.log (very hard to read btw since it seems multiple threads are writing out of order to them), that there was a partfile that was of file size 0.

     

    I would first see something like:

     

     

    [UserMessage] [HDFS UDL INFO|src/Hdfs.cpp,251]: File size: 0

     

    Then i saw a message like this:

     

    [UserMessage] Hdfs - UDx canceled.  Throwing cancellation in UDx

     

    When i removed this one part file that was file size 0, all the others (320 of them) loaded properly in one fell swoop.

     

    The odd thing is that trying to load this file individually works (and simply loads 0 rows), and also works when using comma separated filenames, but seems to fail when loading using glob notation?

     

    Dunno, bug?

  • This is a known issue and has been fixed in 7.1.2-10. See https://my.vertica.com/docs/ReleaseNotes/7.1.x/HP_Vertica_7.1.x_Release_Notes.htm#HP27 for details.

  •  I am using vertica_7.2.0-1_amd64.deb, and this was STILL a problem. Maybe this patch was never given to the 7.2.x series yet, or this was a regression?

  • This issue was fixed in 7.2.1 and backported to 7.1.2-10. You can try to upgrade to 7.2.1 to get this fix. Thanks.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file