Options

Shell_load_package permission error

Hi,

I am trying to use the shell_load_package Extension Package.

I have created a simple python script that transform a line into an other line, called test.py

The script is owned by dbadmin, and executable.

-rwxr-xr-x 1 dbadmin verticadba 5603 May  6 20:31 test.py


I am trying to run the following copy command:


COPY t FROM '/home/dbadmin/test_data/*' ON my_node0001 

WITH filter ExternalFilter(cmd='python /home/dbadmin/test.py')

DELIMITER  E'\t' NULL '\N' 

EXCEPTIONS '/data/copy_errors/'

REJECTED DATA '/data/copy_errors/';

But I am getting the following error:

[Vertica][VJDBC](3399) ERROR: Failure in UDx RPC call InvokeProcessUDL(): Error calling processUDL() in User Defined Object [ExternalFilter] at [ProcessLaunchingPlugin.cpp:82], error code: 0, message: External process 'python /home/dbadmin/test.py' reported error: python: can't open file '/home/dbadmin/test.py': [Errno 13] Permission denied 

If I run that same command with  WITH filter ExternalFilter(cmd='cat'), it works fine.

Quesitons:
1) Any suggestion what the permissions needs to change to make this work?
2) Is there any other way to write a python based UDF that I can just execute on a column of a table in Vertica, other than writing it in C++, Java or R?

Thanks for any suggestions!






Comments

  • Options
    Hi Tibor,

    For permissions -- by default (unless you set "NO_SUDO" during compilation), shell_load_package runs commands as the user "nobody".  I suspect that that user doesn't have access to the "/home/dbadmin/" directory.

    You could grant access to this user by doing "chmod +rx /home/dbadmin/".  Or you could follow the instructions in the README for shell_load_package to disable running as the "nobody" user and just run as "dbadmin".  Both of these, however, have negative security implications.  Alternatively, you could create a world-readable directory elsewhere on your system and place this script there.

    Regarding Python-based UDF's:  At this time, "shell_package" and "shell_load_package" are the only way to do this.  I should point out that we have open-sourced our C++, Java, and R connectors; it would in principle be very possible (if time-consuming) to add support for other languages based on those reference implementations.

    Adam
  • Options
    Thanks Adam, that did the trick!

    Question #3:
    3) Is there a way to speed up the process? When running the copy on 16 core machines hardly use any resources?  Please see the result of top below.
    How do I instruct COPY to use multiple instances of the same script.

    Thanks again for your help,
    Tibor

    top - 14:17:10 up 5 days, 20:57,  3 users,  load average: 0.94, 0.53, 0.22

    Tasks: 366 total,   1 running, 365 sleeping,   0 stopped,   0 zombie

    Cpu0  :  0.0%us,  0.3%sy,  0.3%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu1  :  0.0%us,  0.3%sy,  0.3%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu3  : 81.4%us,  0.7%sy,  0.0%ni, 17.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu9  :  0.0%us,  0.3%sy,  0.3%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu10 :  0.0%us,  0.3%sy,  0.7%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu11 :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu12 :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu13 :  1.0%us,  0.7%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Cpu14 :  1.3%us,  2.7%sy,  0.3%ni, 95.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

    Cpu15 :  0.0%us,  0.7%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

    Mem:  30821764k total, 28277800k used,  2543964k free,   245480k buffers

    Swap:        0k total,        0k used,        0k free, 25083168k cached


      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                  

    20538 dbadmin   20   0 14.7g 1.8g  32m S  8.0  6.2  13:36.26 vertica                                                                                                                   

    17833 dbadmin   20   0 1090m 2804 2136 S  3.3  0.0   0:12.85 vertica-udx-C++                                                                                                           

    21893 dbadmin   20   0 15280 1464  948 R  0.7  0.0   0:00.63 top                                                                                                                       

    28250 dbadmin   20   0 1467m  21m 1244 S  0.7  0.1  62:16.25 python                                                                                                                    

       74 root      20   0     0    0    0 S  0.3  0.0   1:00.66 events/7                                                                                                                  

        1 root      20   0 19352 1184  876 S  0.0  0.0   0:00.51 init                                                                                                                      

        2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                                                                  

        3 root      RT   0     0    0    0 S  0.0  0.0   0:00.24 migration/0                                                                                                               

        4 root      20   0     0    0    0 S  0.0  0.0   0:00.42 ksoftirqd/0                                                                                                               

        5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0           

  • Options
    Hi Tibor, Unfortunately, you can't tell COPY to use multiple copies of the same script because, at the filter stage, it wouldn't know how to split up your data. A filter is just a stream of bytes; for example, you might have a filter that reads in compressed or encrypted data and uncompresses or decrypts it. In which case there's no way to split up the encrypted input. (If there were, it wouldn't be very good encryption :-) ) If you are loading multiple files, Vertica can start up one Python process per file. So splitting your data into multiple files would be one solution. (If you've already done this and are seeing these performance bottlenecks, let me know; it's possible that your resource pools are tuned to not give COPY enough resources to run multiple load threads at once.) If you have a single file and you want your Python script to use multiple cores, I would suggest modifying your Python script to use multiple cores. You'll have to break up the script's input data into chunks; by definition, we don't know how to do that for you. Then, for example, use Python's "multiprocessing" module to have multiple sub-processes each working on a piece of the data. If you had a UDParser, then you could implement a chunker function and Vertica would be able to parallelize your loading operation. Unfortunately, there's no "ExternalParser" package currently on github. Adam
  • Options
    Adam,

    Further issues that I found that even I chuck up a big file into small ones, the COPY executes fast on a few files but gets stuck when you run against 200 files while using globbing. Any suggestions other than doing multiple copies sequentially?

    Thanks,
    Tibor


  • Options
    Hey Tibor,

    I'm curious -- what do you mean by "stuck"?

    Vertica will only load #threads files at a time.  If you give it a large number of files but you only have 16 cores, then it will load the first 16 files, then the second 16 files, etc.  (Assuming the files are the same size -- if not, Vertica uses bin-packing and file-splitting algorithms in an attempt to keep all CPU's busy and working efficiently as much as possible.)

    200 files is actually not all that large a single COPY statement; we have users who regularly load thousands of files at a time.  So I'm surprised to hear that you're getting stuck.

    Could you check your script?  Maybe it's buffering lots of data in memory and the system is swapping?  Or running slowly for other reasons, etc.

    Adam
  • Options
    Hi Adam,

    The script is same as before, except now I am trying to use more files, not just a few.

    COPY t FROM '/home/dbadmin/test_data/*' ON my_node0001 

    WITH filter ExternalFilter(cmd='python /home/dbadmin/test.py')

    DELIMITER  E'\t' NULL '\N' 

    EXCEPTIONS '/data/copy_errors/'

    REJECTED DATA '/data/copy_errors/';

    While a single file would execute in a minute or so, when I am using 200 files, the process is not completing in 200 minutes. At some points, it just hangs and the copy command never returns (I have to ctrl+c out of it the next day), and I see the udf call in the process list even a day later.

    Thanks for your help,
    Tibor
     

  • Options
    Hi Tibor,

    Other than the "cmd='python /home/dbadmin/test.py'" bit, the rest of this statement is an entirely standard/typical COPY command. 

    I asked you some questions about your test.py script in my previous response.  Please let me know what you find.

    You could also take a look at the "load_streams" system table; it will tell you the status of files that are currently loading.

    One observation -- if you're loading just one file, it may fit entirely in WOS, so the COPY never has to wait on disk I/O and doesn't have to pre-sort the data.  If you load enough data, eventually it will of course not fit into the WOS memory buffer and have to go straight to disk.  You can force the data to always go straight to disk (which might make for a more-fair performance comparison in this case, though it risks generating many small ROS files and making much more work for our Tuple Mover later if you do it by default) by adding the "DIRECT" keyword to the end of the COPY statement.  If this goes much slower, then you probably have a disk I/O bottleneck, and should investigate what's going on with your hard disks.  (Other processes writing files?  Degraded RAID array?  Disk is just really slow?)

    Adam

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file