Options

Complete file getting rejected while using Copy command using webhdfs

Hi,

I am trying to copy part files from hdfs to Vertica, using vertica webhdfs connector. I am facing a weird issue. It would be great if some one can help me with this:

There are four files in hdfs as :

/vertica/data/7th_aug/000000_0
/vertica/data/7th_aug/000000_1

/vertica/data/7th_aug/000000_2

/vertica/data/7th_aug/000000_3


The files have total  110000 record each. Total 440000 records.

Below is the copy command

copy abc.temp source Hdfs(url='http://xxmn301301.corp.abc.net:50070/webhdfs/v1/vertica/data/7th_aug/*', username='vertica') DELIMITER E'\001';
 Rows Loaded 

-------------

      220000

(1 row)

Out of 4 files, 2 files are getting loaded completely and 2 of them are getting completely rejected.



vertica=> select count(*) from abc.temp  where country_id = 000000;
 count  

--------

 110000

(1 row)

vertica=> select count(*) from abc.temp  where country_id = 000001;

 count 

-------

     0

(1 row)

vertica=> select count(*) from abc.temp  where country_id = 000002;

 count  

--------

 110000

(1 row)

vertica=> select count(*) from abc.temp  where country_id = 000003

vertica-> ;

 count 

-------

     0

(1 row)



The files are Control A delimited, the once which are getting rejected. Below are the various error I am getting for the load:



 2014-08-10 20:03:12.945102-07 |            1 | ERROR       | Invalid integer format '</html>' for column 1 (country_id)

 2014-08-10 20:03:12.945065-07 |            1 | ERROR       | Invalid integer format '</body>' for column 1 (country_id)

 2014-08-10 20:03:12.945027-07 |            1 | ERROR       | Invalid integer format '<br/>                                                ' for column 1 (country_id)

 2014-08-10 20:03:12.944987-07 |            1 | ERROR       | Invalid integer format '<br/>                                                ' for column 1 (country_id)

 2014-08-10 20:03:12.944271-07 |            1 | ERROR       | Invalid integer format '<pre>    </pre></p><hr /><i><small>Powered by Jetty://</small></i><br/>                                                ' for column 1 (country_id)

 2014-08-10 20:03:12.944229-07 |            1 | ERROR       | Invalid integer format '<p>Problem accessing /webhdfs/v1/vertica/data/7th_aug/000000_3. Reason:' for column 1 (country_id)

 2014-08-10 20:03:12.944191-07 |            1 | ERROR       | Invalid integer format '<body><h2>HTTP ERROR 401</h2>' for column 1 (country_id)

 2014-08-10 20:03:12.944153-07 |            1 | ERROR       | Invalid integer format '</head>' for column 1 (country_id)

 2014-08-10 20:03:12.944116-07 |            1 | ERROR       | Invalid integer format '<title>Error 401 </title>' for column 1 (country_id)

 2014-08-10 20:03:12.944077-07 |            1 | ERROR       | Invalid integer format '<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>' for column 1 (country_id)

 2014-08-10 20:03:12.944037-07 |            1 | ERROR       | Invalid integer format '<head>' for column 1 (country_id)

 2014-08-10 20:03:12.943978-07 |            1 | ERROR       | Invalid integer format '<html>' for column 1 (country_id)

 2014-08-10 20:03:12.934761-07 |            1 | ERROR       | Invalid integer format '</html>' for column 1 (country_id)

 2014-08-10 20:03:12.934725-07 |            1 | ERROR       | Invalid integer format '</body>' for column 1 (country_id)

 2014-08-10 20:03:12.934687-07 |            1 | ERROR       | Invalid integer format '<br/>                                                ' for column 1 (country_id)

 2014-08-10 20:03:12.933925-07 |            1 | ERROR       | Invalid integer format '<pre>    </pre></p><hr /><i><small>Powered by Jetty://</small></i><br/>                                                ' for column 1 (country_id)

 2014-08-10 20:03:12.93388-07  |            1 | ERROR       | Invalid integer format '<p>Problem accessing /webhdfs/v1/vertica/data/7th_aug/000000_1. Reason:' for column 1 (country_id)

 2014-08-10 20:03:12.933842-07 |            1 | ERROR       | Invalid integer format '<body><h2>HTTP ERROR 401</h2>' for column 1 (country_id)

 2014-08-10 20:03:12.933805-07 |            1 | ERROR       | Invalid integer format '</head>' for column 1 (country_id)

 2014-08-10 20:03:12.933766-07 |            1 | ERROR       | Invalid integer format '<title>Error 401 </title>' for column 1 (country_id)

 2014-08-10 20:03:12.93372-07  |            1 | ERROR       | Invalid integer format '<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>' for column 1 (country_id)

 2014-08-10 20:03:12.933649-07 |            1 | ERROR       | Invalid integer format '<head>' for column 1 (country_id)

 2014-08-10 20:03:12.933566-07 |            1 | ERROR       | Invalid integer format '<html>' for column 1 (country_id)

I am not able to understand, how the exact same files are getting loaded but the exact same files are also getting rejected. Let me know if you need any further details.

Regards,
-Amit

Comments

  • Options
    Can someone please update?
  • Options
    Prasanta_PalPrasanta_Pal - Select Field - Employee
    If you are a customer, you may open a case, A Technical support engineer will be working closely with you...

    Try loading the files - one by one,  see if it works, if so then we have to see why it is not working for all the files at time.

  • Options
    Hi Amit,

    If you look at the contents of your rejected data, it looks like WebHDFS isn't serving your file at all; it's instead serving an error page indicating that you don't have permission to access the file in question.

    I'd suggest checking your HDFS server's configuration.  Do you have permission to access the file?  Is your HDFS installation behind a proxy or other host that restricts the number of files that you are allowed to download simultaneously?

    Adam
  • Options
    Hey Adam,
    Thanks for your reply.  I'm working with Nimmi Gupta in support on this issue.  
    It seems that WebHDFS is rejecting half or the part files we're trying to load.  What's strange is that if I copy the part files one at a time it works fine. 
    Yes I have access to those files.


  • Options
    Nimmi_guptaNimmi_gupta - Select Field - Employee
    Hi Amit, we are trying to reproduce this but still we don't see any problem. I have updated the case with detail info. Please check.

    There are total 10 files with 1100000 records to copy.
    wc -l 00000*
    110000 000000_0
    110000 000001_0
    110000 000002_0
    110000 000003_0
    110000 000004_0
    110000 000005_0
    110000 000006_0
    110000 000007_0
    110000 000008_0
    110000 000009_0
    1100000 total

    dbadmin=> select count(*) from sample_test_client;
    count
    -------
    0
    (1 row)

    dbadmin=> COPY sample_test_client SOURCE Hdfs(url='http://10.50.54.94:50070/webhdfs/v1/tmp/00000*', username='root');
    Rows Loaded
    -------------
    1100000
    (1 row)

    dbadmin=> select count(*) from sample_test_client where sbm_country_id = 000001;
    count
    --------
    110000
    (1 row)

    dbadmin=> select count(*) from sample_test_client where sbm_country_id = 000002;
    count
    --------
    110000
    (1 row)

    dbadmin=> select count(*) from sample_test_client where sbm_country_id = 000003;
    count
    --------
    110000
    (1 row)


    thanks
    Nimmi


Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file