ExternalSource() from shell_load_package - problems streaming

Hello

 

I'm trying to do some preprocessing on a dataset before inserting it in to a table.

 

I create a table with 40 columns of varchar(512), named f1 -> f40

 

I'm executing:

 

COPY test SOURCE ExternalSource(cmd='/home/dbadmin/bin/test.py', nodes='v_test_node0001') record terminator E'\n' delimiter E'\t';

 

This script fails:

 

#!/usr/bin/python
import os, sys, json
reload(sys)
sys.setdefaultencoding('utf-8')
f = open('/home/dbadmin/ugly_dataset.json','r')
for line in f:
imp = json.loads(line.decode('latin-1').encode('utf-8'))
msg = imp['message'].split('|:|')
msg.append(imp['tag'])
msg.append(imp['time'])
out = '\t'.join(msg)

print out

It will insert around 400-500 rows of 900k.

 

This script succeeds:

 

#!/usr/bin/python
import os, sys, json
reload(sys)
sys.setdefaultencoding('utf-8')
r = []
f = open('/home/dbadmin/bin/2015-09-28-23_5','r')
for line in f:
imp = json.loads(line.decode('latin-1').encode('utf-8'))
msg = imp['message'].split('|:|')
msg.append(imp['tag'])
msg.append(imp['time'])
out = '\t'.join(msg)

r.append(out)

print '\n'.join(r)

The difference being, the script that succeeds is buffering all of the data in memory before printing to stdout.  But this isn't a real solution for me - I won't have enough memory to buffer the real datasets I have in mind.

 

I have tried moving the workload to a series of pipes in bash with sed/awk and have had similar results.

 

Is there a right way of using ExternalSource() to stream a dataset?

 

 

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file