We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


ExternalSource() from shell_load_package - problems streaming — Vertica Forum

ExternalSource() from shell_load_package - problems streaming

Hello

 

I'm trying to do some preprocessing on a dataset before inserting it in to a table.

 

I create a table with 40 columns of varchar(512), named f1 -> f40

 

I'm executing:

 

COPY test SOURCE ExternalSource(cmd='/home/dbadmin/bin/test.py', nodes='v_test_node0001') record terminator E'\n' delimiter E'\t';

 

This script fails:

 

#!/usr/bin/python
import os, sys, json
reload(sys)
sys.setdefaultencoding('utf-8')
f = open('/home/dbadmin/ugly_dataset.json','r')
for line in f:
imp = json.loads(line.decode('latin-1').encode('utf-8'))
msg = imp['message'].split('|:|')
msg.append(imp['tag'])
msg.append(imp['time'])
out = '\t'.join(msg)

print out

It will insert around 400-500 rows of 900k.

 

This script succeeds:

 

#!/usr/bin/python
import os, sys, json
reload(sys)
sys.setdefaultencoding('utf-8')
r = []
f = open('/home/dbadmin/bin/2015-09-28-23_5','r')
for line in f:
imp = json.loads(line.decode('latin-1').encode('utf-8'))
msg = imp['message'].split('|:|')
msg.append(imp['tag'])
msg.append(imp['time'])
out = '\t'.join(msg)

r.append(out)

print '\n'.join(r)

The difference being, the script that succeeds is buffering all of the data in memory before printing to stdout.  But this isn't a real solution for me - I won't have enough memory to buffer the real datasets I have in mind.

 

I have tried moving the workload to a series of pipes in bash with sed/awk and have had similar results.

 

Is there a right way of using ExternalSource() to stream a dataset?

 

 

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file