Shell_load_package permission error
Hi,
I am trying to use the shell_load_package Extension Package.
I have created a simple python script that transform a line into an other line, called test.py
The script is owned by dbadmin, and executable.
Quesitons:
1) Any suggestion what the permissions needs to change to make this work?
2) Is there any other way to write a python based UDF that I can just execute on a column of a table in Vertica, other than writing it in C++, Java or R?
Thanks for any suggestions!
I am trying to use the shell_load_package Extension Package.
I have created a simple python script that transform a line into an other line, called test.py
The script is owned by dbadmin, and executable.
-rwxr-xr-x 1 dbadmin verticadba 5603 May 6 20:31 test.py
I am trying to run the following copy command:
COPY t FROM '/home/dbadmin/test_data/*' ON my_node0001
WITH filter ExternalFilter(cmd='python /home/dbadmin/test.py')
DELIMITER E'\t' NULL '\N'
EXCEPTIONS '/data/copy_errors/'
REJECTED DATA '/data/copy_errors/';
But I am getting the following error:[Vertica][VJDBC](3399) ERROR: Failure in UDx RPC call InvokeProcessUDL(): Error calling processUDL() in User Defined Object [ExternalFilter] at [ProcessLaunchingPlugin.cpp:82], error code: 0, message: External process 'python /home/dbadmin/test.py' reported error: python: can't open file '/home/dbadmin/test.py': [Errno 13] Permission denied
If I run that same command with WITH filter ExternalFilter(cmd='cat'), it works fine.Quesitons:
1) Any suggestion what the permissions needs to change to make this work?
2) Is there any other way to write a python based UDF that I can just execute on a column of a table in Vertica, other than writing it in C++, Java or R?
Thanks for any suggestions!
0
Comments
For permissions -- by default (unless you set "NO_SUDO" during compilation), shell_load_package runs commands as the user "nobody". I suspect that that user doesn't have access to the "/home/dbadmin/" directory.
You could grant access to this user by doing "chmod +rx /home/dbadmin/". Or you could follow the instructions in the README for shell_load_package to disable running as the "nobody" user and just run as "dbadmin". Both of these, however, have negative security implications. Alternatively, you could create a world-readable directory elsewhere on your system and place this script there.
Regarding Python-based UDF's: At this time, "shell_package" and "shell_load_package" are the only way to do this. I should point out that we have open-sourced our C++, Java, and R connectors; it would in principle be very possible (if time-consuming) to add support for other languages based on those reference implementations.
Adam
Question #3:
3) Is there a way to speed up the process? When running the copy on 16 core machines hardly use any resources? Please see the result of top below.
How do I instruct COPY to use multiple instances of the same script.
Thanks again for your help,
Tibor
top - 14:17:10 up 5 days, 20:57, 3 users, load average: 0.94, 0.53, 0.22
Tasks: 366 total, 1 running, 365 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 81.4%us, 0.7%sy, 0.0%ni, 17.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.0%us, 0.3%sy, 0.7%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 1.0%us, 0.7%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 1.3%us, 2.7%sy, 0.3%ni, 95.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu15 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 30821764k total, 28277800k used, 2543964k free, 245480k buffers
Swap: 0k total, 0k used, 0k free, 25083168k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20538 dbadmin 20 0 14.7g 1.8g 32m S 8.0 6.2 13:36.26 vertica
17833 dbadmin 20 0 1090m 2804 2136 S 3.3 0.0 0:12.85 vertica-udx-C++
21893 dbadmin 20 0 15280 1464 948 R 0.7 0.0 0:00.63 top
28250 dbadmin 20 0 1467m 21m 1244 S 0.7 0.1 62:16.25 python
74 root 20 0 0 0 0 S 0.3 0.0 1:00.66 events/7
1 root 20 0 19352 1184 876 S 0.0 0.0 0:00.51 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.24 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.42 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
Further issues that I found that even I chuck up a big file into small ones, the COPY executes fast on a few files but gets stuck when you run against 200 files while using globbing. Any suggestions other than doing multiple copies sequentially?
Thanks,
Tibor
I'm curious -- what do you mean by "stuck"?
Vertica will only load #threads files at a time. If you give it a large number of files but you only have 16 cores, then it will load the first 16 files, then the second 16 files, etc. (Assuming the files are the same size -- if not, Vertica uses bin-packing and file-splitting algorithms in an attempt to keep all CPU's busy and working efficiently as much as possible.)
200 files is actually not all that large a single COPY statement; we have users who regularly load thousands of files at a time. So I'm surprised to hear that you're getting stuck.
Could you check your script? Maybe it's buffering lots of data in memory and the system is swapping? Or running slowly for other reasons, etc.
Adam
The script is same as before, except now I am trying to use more files, not just a few.
COPY t FROM '/home/dbadmin/test_data/*' ON my_node0001
WITH filter ExternalFilter(cmd='python /home/dbadmin/test.py')
DELIMITER E'\t' NULL '\N'
EXCEPTIONS '/data/copy_errors/'
REJECTED DATA '/data/copy_errors/';
While a single file would execute in a minute or so, when I am using 200 files, the process is not completing in 200 minutes. At some points, it just hangs and the copy command never returns (I have to ctrl+c out of it the next day), and I see the udf call in the process list even a day later.Thanks for your help,
Tibor
Other than the "cmd='python /home/dbadmin/test.py'" bit, the rest of this statement is an entirely standard/typical COPY command.
I asked you some questions about your test.py script in my previous response. Please let me know what you find.
You could also take a look at the "load_streams" system table; it will tell you the status of files that are currently loading.
One observation -- if you're loading just one file, it may fit entirely in WOS, so the COPY never has to wait on disk I/O and doesn't have to pre-sort the data. If you load enough data, eventually it will of course not fit into the WOS memory buffer and have to go straight to disk. You can force the data to always go straight to disk (which might make for a more-fair performance comparison in this case, though it risks generating many small ROS files and making much more work for our Tuple Mover later if you do it by default) by adding the "DIRECT" keyword to the end of the COPY statement. If this goes much slower, then you probably have a disk I/O bottleneck, and should investigate what's going on with your hard disks. (Other processes writing files? Degraded RAID array? Disk is just really slow?)
Adam