One common task when working with Hadoop and HP Vertica is loading text files from the Hadoop Distributed File System (HDFS) into an HP Vertica table. You can load these files using Hadoop streaming, saving yourself the trouble of having to write custom map and reduce classes.
Note: Hadoop streaming is less efficient than a Java map/reduce Hadoop job, since it passes data through several different interfaces. Streaming is best used for smaller, one-time loads. If you need to load large amounts of data on a regular basis, you should create a standard Hadoop map/reduce job in Java or a script in Pig.
For example, suppose you have a text file in the HDFS you want to load contains values delimited by pipe characters (|), with each line of the file is terminated by a carriage return:
In this case, the line delimiter poses a problem. You can easily include the column delimiter in the Hadoop command line arguments. However, it is hard to specify a carriage return in the Hadoop command line. To get around this issue, you can write a mapper script to strip the carriage return and replace it with some other character that is easy to enter in the command line and also does not occur in the data.
Below is an example of a mapper script written in Python. It performs two tasks:
Strips the carriage returns from the input text and terminates each line with a tilde (~).
Adds a key value (the string "streaming") followed by a tab character at the start of each line of the text file. The mapper script needs to do this because the streaming job to read text files skips the reducer stage. The reducer isn't necessary, since the all of the data being read in text file should be stored in the HP Vertica tables. However, VerticaStreamingOutput class requires key and values pairs, so the mapper script adds the key.
#!/usr/bin/python import sys for line in sys.stdin.readlines(): # Get rid of carriage returns. # CR is used as the record terminator by Streaming.jar line = line.strip(); # Add a key. The key value can be anything. # The convention is to use the name of the # target table, as shown here. sys.stdout.write("streaming\t%s~\n" % line)
The Hadoop command to stream text files from the HDFS into HP Vertica using the above mapper script appears below.
The -Dmapred.reduce-tasks=0 parameter disables the streaming job's reducer stage. It does not need a reducer since the mapper script processes the data into the format that the VerticaStreamingOutput class expects.
Even though the VerticaStreamingOutput class is handling the output from the mapper, you need to supply a valid output directory to the Hadoop command.
The result of running the command is a new table in the HP Vertica database:
=> SELECT * FROM streaming; intcol | floatcol | varcharcol --------+----------+------------ 3 | 3 | THREE 1 | 1 | ONE 2 | 2 | TWO (3 rows)
Comments
One common task when working with Hadoop and HP Vertica is loading text files from the Hadoop Distributed File System (HDFS) into an HP Vertica table. You can load these files using Hadoop streaming, saving yourself the trouble of having to write custom map and reduce classes.
Note: Hadoop streaming is less efficient than a Java map/reduce Hadoop job, since it passes data through several different interfaces. Streaming is best used for smaller, one-time loads. If you need to load large amounts of data on a regular basis, you should create a standard Hadoop map/reduce job in Java or a script in Pig.
For example, suppose you have a text file in the HDFS you want to load contains values delimited by pipe characters (|), with each line of the file is terminated by a carriage return:
In this case, the line delimiter poses a problem. You can easily include the column delimiter in the Hadoop command line arguments. However, it is hard to specify a carriage return in the Hadoop command line. To get around this issue, you can write a mapper script to strip the carriage return and replace it with some other character that is easy to enter in the command line and also does not occur in the data.
Below is an example of a mapper script written in Python. It performs two tasks:
VerticaStreamingOutput
class requires key and values pairs, so the mapper script adds the key.The Hadoop command to stream text files from the HDFS into HP Vertica using the above mapper script appears below.
Notes-Dmapred.reduce-tasks=0
parameter disables the streaming job's reducer stage. It does not need a reducer since the mapper script processes the data into the format that theVerticaStreamingOutput
class expects.VerticaStreamingOutput
class is handling the output from the mapper, you need to supply a valid output directory to the Hadoop command.The result of running the command is a new table in the HP Vertica database:
hi check this link
https://my.vertica.com/docs/6.1.x/HTML/index.htm#15552.htm
hav 2 install both hadoop and vertica and install connector will check it out and tel u in the mean time check the link out