Copy hive part files( ctrl-A delimiters ('\001' ) to Vertica

Amit_1 · July 2014

Hi,

I am trying to load few part files produced from Hive table. These files are in HDFS directory and are ctrl-A delimiters ('\001' ).

I am using vertica hdfs connector to copy these data.

copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/*', username='abc') DELIMITER E'\001' ;

This command is not able to recognize the ctrl-A delimiters ('\001' ) and hence the records are getting rejected. Can you please let me know the correct way to load this data.
The Copy command works for '|' delimited files though.

Regards,
-Amit

Amit_1 · July 2014

I got the issue, vertica copy command apparently has problem with wild card search(like *).

copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/*', username='abc') DELIMITER E'\001' ;

this does not work, data gets leaded, but lots of records are rejected.

but when I give the file name individually like

copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/file_1', username='abc') DELIMITER E'\001' ;

copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/file_2', username='abc') DELIMITER E'\001' ;

This works.

But the issue I have a lot of part files, is there a way to load all of them, without mentioning them individually.

Amit_1 · July 2014

Can someone please reply?

[Deleted User] · July 2014

Hi Amit,

It sounds to me like some of your files are formatted differently, or like there is some other issue in your environment.

Unfortunately, I don't see enough here to help sort things out.

Have you verified that this glob is matching the set of files that you think it is matching? Have you verified that each file (not just the first two) loads individually?

Also -- you say that data does get loaded but lots of records are rejected. Which rows are rejected? Which files do those rows come from? Why are they rejected? (Vertica gives a reason for rejection for each rejected row.)

If you investigate your data set a bit more and narrow down the problem as much as you can with regard to your own data and environment, it is more likely that people here will see what's going on with Vertica, so, more likely that you will get a response.

Adam

Amit_1 · July 2014

Hi Adam,

My source files are created from Hive which are '\001' delimited by default.

My files are created in hdfs from hive:
Sample File names:
-----------------------
/user/temp/000000_0
/user/temp/000001_0
/user/temp/000002_0
/user/temp/000003_0
/user/temp/000004_0
/user/temp/000005_0
/user/temp/000006_0
/user/temp/000007_0

What my concern is, I can load all the files correctly by givning the file name individually:

copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/000000_0', username='abc') DELIMITER E'\001' ;

copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/000001_0', username='abc') DELIMITER E'\001' ;

etc.

If there is any data problem, I would have been able to load the data files individually as well.

but when I mention wild card to select all the files:

copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/*', username='abc') DELIMITER E'\001' ;

Many rows get rejected, has it anything to do with I am gong my wild card(*) to select all the files

Regards,
-Amit

We're Moving!

Create My New Community Account Now

Copy hive part files( ctrl-A delimiters ('\001' ) to Vertica

Comments

Leave a Comment