Copy hive part files( ctrl-A delimiters ('\001' ) to Vertica

 Hi,

I am trying to load few part files produced from Hive table. These files are in HDFS directory and are ctrl-A delimiters ('\001' ).

I am using vertica hdfs connector to copy these data.

  copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/*', username='abc') DELIMITER E'\001' ;

This command is not able to recognize the ctrl-A delimiters ('\001' ) and hence the records are getting rejected. Can you please let me know the correct way to load this data.
The Copy command works for '|' delimited files though.

Regards,
-Amit

Comments

  • I got the issue, vertica copy command apparently has problem with wild card search(like *).

     copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/*', username='abc') DELIMITER E'\001' ;

    this does not work, data gets leaded, but lots of records are rejected.

    but when I give the file name individually like

     copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/file_1', username='abc') DELIMITER E'\001' ;

     copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/file_2', username='abc') DELIMITER E'\001' ;

    This works.

    But the issue I have a lot of part files, is there a way to load all of them, without mentioning them individually.


  • Can someone please reply?

  • Hi Amit,

    It sounds to me like some of your files are formatted differently, or like there is some other issue in your environment.

    Unfortunately, I don't see enough here to help sort things out.

    Have you verified that this glob is matching the set of files that you think it is matching?  Have you verified that each file (not just the first two) loads individually?

    Also -- you say that data does get loaded but lots of records are rejected.  Which rows are rejected?  Which files do those rows come from?  Why are they rejected?  (Vertica gives a reason for rejection for each rejected row.)

    If you investigate your data set a bit more and narrow down the problem as much as you can with regard to your own data and environment, it is more likely that people here will see what's going on with Vertica, so, more likely that you will get a response.

    Adam
  • Hi Adam,

    My source files are created from Hive which are '\001' delimited by default. 

    My files are created in hdfs from hive:
    Sample File names:
    -----------------------
    /user/temp/000000_0
    /user/temp/000001_0
    /user/temp/000002_0
    /user/temp/000003_0
    /user/temp/000004_0
    /user/temp/000005_0
    /user/temp/000006_0
    /user/temp/000007_0

     What my concern is, I can load all the files correctly by givning the file name individually:

    copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/000000_0', username='abc') DELIMITER E'\001' ;

     copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/000001_0', username='abc') DELIMITER E'\001' ;

    etc.

    If there is any data problem, I would have been able to load the data files individually as well.

    but when I mention wild card to select all the files:

     copy schema.temp Hdfs(url='http://78.146.185.212:60070/webhdfs/v1/user/temp/*', username='abc') DELIMITER E'\001' ;


    Many rows get rejected, has it anything to do with I am gong my wild card(*) to select all the files

    Regards,
    -Amit

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file