hadoop - how to write data from HDFS to vertia using hadoop streaming

Ravi_Ranjan_Kum · November 2013

Nimmi_gupta · January 2014

Loading a Text File From HDFS into HP Vertica

One common task when working with Hadoop and HP Vertica is loading text files from the Hadoop Distributed File System (HDFS) into an HP Vertica table. You can load these files using Hadoop streaming, saving yourself the trouble of having to write custom map and reduce classes.

Note: Hadoop streaming is less efficient than a Java map/reduce Hadoop job, since it passes data through several different interfaces. Streaming is best used for smaller, one-time loads. If you need to load large amounts of data on a regular basis, you should create a standard Hadoop map/reduce job in Java or a script in Pig.

For example, suppose you have a text file in the HDFS you want to load contains values delimited by pipe characters (|), with each line of the file is terminated by a carriage return:

# $HADOOP_HOME/bin/hadoop dfs -cat /tmp/textdata.txt 1|1.0|ONE 2|2.0|TWO 3|3.0|THREE

In this case, the line delimiter poses a problem. You can easily include the column delimiter in the Hadoop command line arguments. However, it is hard to specify a carriage return in the Hadoop command line. To get around this issue, you can write a mapper script to strip the carriage return and replace it with some other character that is easy to enter in the command line and also does not occur in the data.

Below is an example of a mapper script written in Python. It performs two tasks:

Strips the carriage returns from the input text and terminates each line with a tilde (~).
Adds a key value (the string "streaming") followed by a tab character at the start of each line of the text file. The mapper script needs to do this because the streaming job to read text files skips the reducer stage. The reducer isn't necessary, since the all of the data being read in text file should be stored in the HP Vertica tables. However, VerticaStreamingOutput class requires key and values pairs, so the mapper script adds the key.

#!/usr/bin/python import sys  for line in sys.stdin.readlines():     # Get rid of carriage returns.     # CR is used as the record terminator by Streaming.jar     line = line.strip();     # Add a key. The key value can be anything.     # The convention is to use the name of the     # target table, as shown here.     sys.stdout.write("streaming\t%s~\n" % line)

The Hadoop command to stream text files from the HDFS into HP Vertica using the above mapper script appears below.

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \     -libjars $HADOOP_HOME/lib/hadoop-vertica.jar \     -Dmapred.reduce.tasks=0 \     -Dmapred.vertica.output.table.name=streaming \     -Dmapred.vertica.output.table.def="intcol integer, floatcol float, varcharcol varchar" \     -Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \      -Dmapred.vertica.port=5433 \     -Dmapred.vertica.username=ExampleUser \     -Dmapred.vertica.password=password123 \     -Dmapred.vertica.database=ExampleDB \     -Dmapred.vertica.output.delimiter="|" \     -Dmapred.vertica.output.terminator="~" \     -input /tmp/textdata.txt \     -output output \     -mapper "python path-to-script/mapper.py" \     -outputformat com.vertica.hadoop.deprecated.VerticaStreamingOutput

Notes

The -Dmapred.reduce-tasks=0 parameter disables the streaming job's reducer stage. It does not need a reducer since the mapper script processes the data into the format that the VerticaStreamingOutput class expects.
Even though the VerticaStreamingOutput class is handling the output from the mapper, you need to supply a valid output directory to the Hadoop command.

The result of running the command is a new table in the HP Vertica database:

=> SELECT * FROM streaming;  intcol | floatcol | varcharcol  --------+----------+------------       3 |        3 | THREE       1 |        1 | ONE       2 |        2 | TWO (3 rows)

Rahul_Choudhary · January 2014

hi check this link
https://my.vertica.com/docs/6.1.x/HTML/index.htm#15552.htm
hav 2 install both hadoop and vertica and install connector will check it out and tel u in the mean time check the link out

We're Moving!

Create My New Community Account Now

hadoop - how to write data from HDFS to vertia using hadoop streaming

Comments

Leave a Comment