One common task when working with Hadoop and HP Vertica is loading text files from the Hadoop Distributed File System (HDFS) into an HP Vertica table. You can load these files using Hadoop streaming, saving yourself the trouble of having to write custom map and reduce classes.
Note: Hadoop streaming is less efficient than a Java map/reduce Hadoop job, since it passes data through several different interfaces. Streaming is best used for smaller, one-time loads. If you need to load large amounts of data on a regular basis, you should create a standard Hadoop map/reduce job in Java or a script in Pig.
For example, suppose you have a text file in the HDFS you want to load contains values delimited by pipe characters (|), with each line of the file is terminated by a carriage return:
# $HADOOP_HOME/bin/hadoop dfs -cat /tmp/textdata.txt 1|1.0|ONE 2|2.0|TWO 3|3.0|THREE
In this case, the line delimiter poses a problem. You can easily include the column delimiter in the Hadoop command line arguments. However, it is hard to specify a carriage return in the Hadoop command line. To get around this issue, you can write a mapper script to strip the carriage return and replace it with some other character that is easy to enter in the command line and also does not occur in the data.
Below is an example of a mapper script written in Python. It performs two tasks:
#!/usr/bin/python import sys for line in sys.stdin.readlines(): # Get rid of carriage returns. # CR is used as the record terminator by Streaming.jar line = line.strip(); # Add a key. The key value can be anything. # The convention is to use the name of the # target table, as shown here. sys.stdout.write("streaming\t%s~\n" % line)
The Hadoop command to stream text files from the HDFS into HP Vertica using the above mapper script appears below.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \ -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \ -Dmapred.reduce.tasks=0 \ -Dmapred.vertica.output.table.name=streaming \ -Dmapred.vertica.output.table.def="intcol integer, floatcol float, varcharcol varchar" \ -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \ -Dmapred.vertica.port=5433 \ -Dmapred.vertica.username=ExampleUser \ -Dmapred.vertica.password=password123 \ -Dmapred.vertica.database=ExampleDB \ -Dmapred.vertica.output.delimiter="|" \ -Dmapred.vertica.output.terminator="~" \ -input /tmp/textdata.txt \ -output output \ -mapper "python path-to-script/mapper.py" \ -outputformat com.vertica.hadoop.deprecated.VerticaStreamingOutput
The result of running the command is a new table in the HP Vertica database:
=> SELECT * FROM streaming; intcol | floatcol | varcharcol --------+----------+------------ 3 | 3 | THREE 1 | 1 | ONE 2 | 2 | TWO (3 rows)
Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.