ExternalSource() from shell_load_package - problems streaming

scameron · October 2015

Hello

I'm trying to do some preprocessing on a dataset before inserting it in to a table.

I create a table with 40 columns of varchar(512), named f1 -> f40

I'm executing:

COPY test SOURCE ExternalSource(cmd='/home/dbadmin/bin/test.py', nodes='v_test_node0001') record terminator E'\n' delimiter E'\t';

This script fails:

#!/usr/bin/python
import os, sys, json
reload(sys)
sys.setdefaultencoding('utf-8')
f = open('/home/dbadmin/ugly_dataset.json','r')
for line in f:
    imp = json.loads(line.decode('latin-1').encode('utf-8'))
    msg = imp['message'].split('|:|')
    msg.append(imp['tag'])
    msg.append(imp['time'])
    out = '\t'.join(msg)

    print out

It will insert around 400-500 rows of 900k.

This script succeeds:

#!/usr/bin/python
import os, sys, json
reload(sys)
sys.setdefaultencoding('utf-8')
r = []
f = open('/home/dbadmin/bin/2015-09-28-23_5','r')
for line in f:
    imp = json.loads(line.decode('latin-1').encode('utf-8'))
    msg = imp['message'].split('|:|')
    msg.append(imp['tag'])
    msg.append(imp['time'])
    out = '\t'.join(msg)

    r.append(out)

print '\n'.join(r)

The difference being, the script that succeeds is buffering all of the data in memory before printing to stdout. But this isn't a real solution for me - I won't have enough memory to buffer the real datasets I have in mind.

I have tried moving the workload to a series of pipes in bash with sed/awk and have had similar results.

Is there a right way of using ExternalSource() to stream a dataset?

We're Moving!

Create My New Community Account Now

ExternalSource() from shell_load_package - problems streaming

Leave a Comment