ExternalSource() from shell_load_package - problems streaming
Hello
I'm trying to do some preprocessing on a dataset before inserting it in to a table.
I create a table with 40 columns of varchar(512), named f1 -> f40
I'm executing:
COPY test SOURCE ExternalSource(cmd='/home/dbadmin/bin/test.py', nodes='v_test_node0001') record terminator E'\n' delimiter E'\t';
This script fails:
#!/usr/bin/python
import os, sys, json
reload(sys)
sys.setdefaultencoding('utf-8')
f = open('/home/dbadmin/ugly_dataset.json','r')
for line in f:
imp = json.loads(line.decode('latin-1').encode('utf-8'))
msg = imp['message'].split('|:|')
msg.append(imp['tag'])
msg.append(imp['time'])
out = '\t'.join(msg)
print out
It will insert around 400-500 rows of 900k.
This script succeeds:
#!/usr/bin/python
import os, sys, json
reload(sys)
sys.setdefaultencoding('utf-8')
r = []
f = open('/home/dbadmin/bin/2015-09-28-23_5','r')
for line in f:
imp = json.loads(line.decode('latin-1').encode('utf-8'))
msg = imp['message'].split('|:|')
msg.append(imp['tag'])
msg.append(imp['time'])
out = '\t'.join(msg)
r.append(out)
print '\n'.join(r)
The difference being, the script that succeeds is buffering all of the data in memory before printing to stdout. But this isn't a real solution for me - I won't have enough memory to buffer the real datasets I have in mind.
I have tried moving the workload to a series of pipes in bash with sed/awk and have had similar results.
Is there a right way of using ExternalSource() to stream a dataset?