Complete file getting rejected while using Copy command using webhdfs
Hi,
I am trying to copy part files from hdfs to Vertica, using vertica webhdfs connector. I am facing a weird issue. It would be great if some one can help me with this:
There are four files in hdfs as :
/vertica/data/7th_aug/000000_0
/vertica/data/7th_aug/000000_1
/vertica/data/7th_aug/000000_2
/vertica/data/7th_aug/000000_3
The files have total 110000 record each. Total 440000 records.
Below is the copy command
copy abc.temp source Hdfs(url='http://xxmn301301.corp.abc.net:50070/webhdfs/v1/vertica/data/7th_aug/*', username='vertica') DELIMITER E'\001';
Rows Loaded
-------------
220000
(1 row)
Out of 4 files, 2 files are getting loaded completely and 2 of them are getting completely rejected.
vertica=> select count(*) from abc.temp where country_id = 000000;
count
--------
110000
(1 row)
vertica=> select count(*) from abc.temp where country_id = 000001;
count
-------
0
(1 row)
vertica=> select count(*) from abc.temp where country_id = 000002;
count
--------
110000
(1 row)
vertica=> select count(*) from abc.temp where country_id = 000003
vertica-> ;
count
-------
0
(1 row)
The files are Control A delimited, the once which are getting rejected. Below are the various error I am getting for the load:
2014-08-10 20:03:12.945102-07 | 1 | ERROR | Invalid integer format '</html>' for column 1 (country_id)
2014-08-10 20:03:12.945065-07 | 1 | ERROR | Invalid integer format '</body>' for column 1 (country_id)
2014-08-10 20:03:12.945027-07 | 1 | ERROR | Invalid integer format '<br/> ' for column 1 (country_id)
2014-08-10 20:03:12.944987-07 | 1 | ERROR | Invalid integer format '<br/> ' for column 1 (country_id)
2014-08-10 20:03:12.944271-07 | 1 | ERROR | Invalid integer format '<pre> </pre></p><hr /><i><small>Powered by Jetty://</small></i><br/> ' for column 1 (country_id)
2014-08-10 20:03:12.944229-07 | 1 | ERROR | Invalid integer format '<p>Problem accessing /webhdfs/v1/vertica/data/7th_aug/000000_3. Reason:' for column 1 (country_id)
2014-08-10 20:03:12.944191-07 | 1 | ERROR | Invalid integer format '<body><h2>HTTP ERROR 401</h2>' for column 1 (country_id)
2014-08-10 20:03:12.944153-07 | 1 | ERROR | Invalid integer format '</head>' for column 1 (country_id)
2014-08-10 20:03:12.944116-07 | 1 | ERROR | Invalid integer format '<title>Error 401 </title>' for column 1 (country_id)
2014-08-10 20:03:12.944077-07 | 1 | ERROR | Invalid integer format '<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>' for column 1 (country_id)
2014-08-10 20:03:12.944037-07 | 1 | ERROR | Invalid integer format '<head>' for column 1 (country_id)
2014-08-10 20:03:12.943978-07 | 1 | ERROR | Invalid integer format '<html>' for column 1 (country_id)
2014-08-10 20:03:12.934761-07 | 1 | ERROR | Invalid integer format '</html>' for column 1 (country_id)
2014-08-10 20:03:12.934725-07 | 1 | ERROR | Invalid integer format '</body>' for column 1 (country_id)
2014-08-10 20:03:12.934687-07 | 1 | ERROR | Invalid integer format '<br/> ' for column 1 (country_id)
2014-08-10 20:03:12.933925-07 | 1 | ERROR | Invalid integer format '<pre> </pre></p><hr /><i><small>Powered by Jetty://</small></i><br/> ' for column 1 (country_id)
2014-08-10 20:03:12.93388-07 | 1 | ERROR | Invalid integer format '<p>Problem accessing /webhdfs/v1/vertica/data/7th_aug/000000_1. Reason:' for column 1 (country_id)
2014-08-10 20:03:12.933842-07 | 1 | ERROR | Invalid integer format '<body><h2>HTTP ERROR 401</h2>' for column 1 (country_id)
2014-08-10 20:03:12.933805-07 | 1 | ERROR | Invalid integer format '</head>' for column 1 (country_id)
2014-08-10 20:03:12.933766-07 | 1 | ERROR | Invalid integer format '<title>Error 401 </title>' for column 1 (country_id)
2014-08-10 20:03:12.93372-07 | 1 | ERROR | Invalid integer format '<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>' for column 1 (country_id)
2014-08-10 20:03:12.933649-07 | 1 | ERROR | Invalid integer format '<head>' for column 1 (country_id)
2014-08-10 20:03:12.933566-07 | 1 | ERROR | Invalid integer format '<html>' for column 1 (country_id)
I am not able to understand, how the exact same files are getting loaded but the exact same files are also getting rejected. Let me know if you need any further details.
Regards,
-Amit
I am trying to copy part files from hdfs to Vertica, using vertica webhdfs connector. I am facing a weird issue. It would be great if some one can help me with this:
There are four files in hdfs as :
/vertica/data/7th_aug/000000_0
/vertica/data/7th_aug/000000_1
/vertica/data/7th_aug/000000_2
/vertica/data/7th_aug/000000_3
The files have total 110000 record each. Total 440000 records.
Below is the copy command
copy abc.temp source Hdfs(url='http://xxmn301301.corp.abc.net:50070/webhdfs/v1/vertica/data/7th_aug/*', username='vertica') DELIMITER E'\001';
Rows Loaded
-------------
220000
(1 row)
Out of 4 files, 2 files are getting loaded completely and 2 of them are getting completely rejected.
vertica=> select count(*) from abc.temp where country_id = 000000;
count
--------
110000
(1 row)
vertica=> select count(*) from abc.temp where country_id = 000001;
count
-------
0
(1 row)
vertica=> select count(*) from abc.temp where country_id = 000002;
count
--------
110000
(1 row)
vertica=> select count(*) from abc.temp where country_id = 000003
vertica-> ;
count
-------
0
(1 row)
The files are Control A delimited, the once which are getting rejected. Below are the various error I am getting for the load:
2014-08-10 20:03:12.945102-07 | 1 | ERROR | Invalid integer format '</html>' for column 1 (country_id)
2014-08-10 20:03:12.945065-07 | 1 | ERROR | Invalid integer format '</body>' for column 1 (country_id)
2014-08-10 20:03:12.945027-07 | 1 | ERROR | Invalid integer format '<br/> ' for column 1 (country_id)
2014-08-10 20:03:12.944987-07 | 1 | ERROR | Invalid integer format '<br/> ' for column 1 (country_id)
2014-08-10 20:03:12.944271-07 | 1 | ERROR | Invalid integer format '<pre> </pre></p><hr /><i><small>Powered by Jetty://</small></i><br/> ' for column 1 (country_id)
2014-08-10 20:03:12.944229-07 | 1 | ERROR | Invalid integer format '<p>Problem accessing /webhdfs/v1/vertica/data/7th_aug/000000_3. Reason:' for column 1 (country_id)
2014-08-10 20:03:12.944191-07 | 1 | ERROR | Invalid integer format '<body><h2>HTTP ERROR 401</h2>' for column 1 (country_id)
2014-08-10 20:03:12.944153-07 | 1 | ERROR | Invalid integer format '</head>' for column 1 (country_id)
2014-08-10 20:03:12.944116-07 | 1 | ERROR | Invalid integer format '<title>Error 401 </title>' for column 1 (country_id)
2014-08-10 20:03:12.944077-07 | 1 | ERROR | Invalid integer format '<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>' for column 1 (country_id)
2014-08-10 20:03:12.944037-07 | 1 | ERROR | Invalid integer format '<head>' for column 1 (country_id)
2014-08-10 20:03:12.943978-07 | 1 | ERROR | Invalid integer format '<html>' for column 1 (country_id)
2014-08-10 20:03:12.934761-07 | 1 | ERROR | Invalid integer format '</html>' for column 1 (country_id)
2014-08-10 20:03:12.934725-07 | 1 | ERROR | Invalid integer format '</body>' for column 1 (country_id)
2014-08-10 20:03:12.934687-07 | 1 | ERROR | Invalid integer format '<br/> ' for column 1 (country_id)
2014-08-10 20:03:12.933925-07 | 1 | ERROR | Invalid integer format '<pre> </pre></p><hr /><i><small>Powered by Jetty://</small></i><br/> ' for column 1 (country_id)
2014-08-10 20:03:12.93388-07 | 1 | ERROR | Invalid integer format '<p>Problem accessing /webhdfs/v1/vertica/data/7th_aug/000000_1. Reason:' for column 1 (country_id)
2014-08-10 20:03:12.933842-07 | 1 | ERROR | Invalid integer format '<body><h2>HTTP ERROR 401</h2>' for column 1 (country_id)
2014-08-10 20:03:12.933805-07 | 1 | ERROR | Invalid integer format '</head>' for column 1 (country_id)
2014-08-10 20:03:12.933766-07 | 1 | ERROR | Invalid integer format '<title>Error 401 </title>' for column 1 (country_id)
2014-08-10 20:03:12.93372-07 | 1 | ERROR | Invalid integer format '<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>' for column 1 (country_id)
2014-08-10 20:03:12.933649-07 | 1 | ERROR | Invalid integer format '<head>' for column 1 (country_id)
2014-08-10 20:03:12.933566-07 | 1 | ERROR | Invalid integer format '<html>' for column 1 (country_id)
I am not able to understand, how the exact same files are getting loaded but the exact same files are also getting rejected. Let me know if you need any further details.
Regards,
-Amit
0
Comments
Try loading the files - one by one, see if it works, if so then we have to see why it is not working for all the files at time.
If you look at the contents of your rejected data, it looks like WebHDFS isn't serving your file at all; it's instead serving an error page indicating that you don't have permission to access the file in question.
I'd suggest checking your HDFS server's configuration. Do you have permission to access the file? Is your HDFS installation behind a proxy or other host that restricts the number of files that you are allowed to download simultaneously?
Adam
Thanks for your reply. I'm working with Nimmi Gupta in support on this issue.
It seems that WebHDFS is rejecting half or the part files we're trying to load. What's strange is that if I copy the part files one at a time it works fine.
Yes I have access to those files.
There are total 10 files with 1100000 records to copy.
wc -l 00000*
110000 000000_0
110000 000001_0
110000 000002_0
110000 000003_0
110000 000004_0
110000 000005_0
110000 000006_0
110000 000007_0
110000 000008_0
110000 000009_0
1100000 total
dbadmin=> select count(*) from sample_test_client;
count
-------
0
(1 row)
dbadmin=> COPY sample_test_client SOURCE Hdfs(url='http://10.50.54.94:50070/webhdfs/v1/tmp/00000*', username='root');
Rows Loaded
-------------
1100000
(1 row)
dbadmin=> select count(*) from sample_test_client where sbm_country_id = 000001;
count
--------
110000
(1 row)
dbadmin=> select count(*) from sample_test_client where sbm_country_id = 000002;
count
--------
110000
(1 row)
dbadmin=> select count(*) from sample_test_client where sbm_country_id = 000003;
count
--------
110000
(1 row)
thanks
Nimmi