copy from S3 fails
copy from S3 failed with *, however when i try the similar command with the name of the each file explicitly then it succeeds
dbadmin=> COPY poc.hrly_s3 FROM 's3://bucket/data/*' PARQUET;
ERROR 6764: Error opening file [s3://s3://bucket/data/1a4701af353ea0f2-773898e30000000f_415672635_data.0.parq] for read: Access Denied
ERROR 6764: Error opening file [s3://bucket/data/3/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq] for read: S3 operation (open) was canceled
ERROR 7253: Error reading from Parquet parser input stream [s3://bucket/data/1a4701af353ea0f2-773898e300000010_437174618_data.0.parq]: S3 operation (read) was canceled
d
0
Comments
Can you specify
select set_debug_log('SAL', 'ALL');
and rerun the query with the error and post the relevant logs?2018-02-12 16:47:04.551 Init Session:7f2d0ffff700-a0000000042f88 [Session] [Query] TX:a0000000042f88(v_verticadb_node0001-321247:0x54744) COPY poc.hrly_s3 FROM 's3://bucket/data/' PARQUET;
2018-02-12 16:47:04.553 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Adding (txn=a0000000042f88,statement=4) to HDFS cancel callback monitoring
2018-02-12 16:47:04.553 Init Session:7f2d0ffff700-a0000000042f88 [SAL] [S3UDFS] [s3://bucket/data/]: virtual void SAL::S3FileSystem::glob(const string&, std::vector<std::pair<std::basic_string, stat> >&, SAL::GlobPruner&, bool) const, Glob
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Disposing cURL handle 0x7f28a0243a00
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Disposing cURL handle 0x7f28a0243a00
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Disposing cURL handle 0x7f28a0243a00
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Disposing cURL handle 0x7f28a0243a00
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Disposing cURL handle 0x7f28a0243a00
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Disposing cURL handle 0x7f28a0243a00
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Disposing cURL handle 0x7f28a0243a00
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Disposing cURL handle 0x7f28a0243a00
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Removing (txn=a0000000042f88,statement=4( from HDFS cancel callback monitoring
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Adding (txn=a0000000042f88,statement=4) to HDFS cancel callback monitoring
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 @v_verticadb_node0001: 00000/6610: 0 percent of data matched to local nodes (min 0, max 0 percent per node). 0 percent of data can be loaded within same racks
2018-02-12 16:47:04.610 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Removing (txn=a0000000042f88,statement=4( from HDFS cancel callback monitoring
2018-02-12 16:47:04.619 Init Session:7f2d0ffff700-a0000000042f88 [Opt] Dump SendRecvNodes Map :
2018-02-12 16:47:04.625 Init Session:7f2d0ffff700-a0000000042f88 [Opt] Load is using default ExprEval, not LoadExprEval
2018-02-12 16:47:04.646 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Adding (txn=a0000000042f88,statement=4) to HDFS cancel callback monitoring
2018-02-12 16:47:04.646 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Removing (txn=a0000000042f88,statement=4( from HDFS cancel callback monitoring
2018-02-12 16:47:04.646 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Adding (txn=a0000000042f88,statement=4) to HDFS cancel callback monitoring
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2d24ffa700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2d0effd700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2d0f7fe700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2d2665e700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2d0e7fc700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2d0dffb700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2d05ffe700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2d057fd700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2cfe7fc700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2cfd7fa700 [Util] ActiveQueue: beginning processing
2018-02-12 16:47:04.646 PlanParquetFileQueue:7f2d24ffa700 [SAL] [S3UDFS] File opened: s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq
2018-02-12 16:47:04.660 PlanParquetFileQueue:7f2d24ffa700 [SAL] [S3UDFS] S3FileCancelHandle :: Adding (txn=a0000000042f88,statement=4) to S3FS cancel callback monitoring
2018-02-12 16:47:04.660 PlanParquetFileQueue:7f2d24ffa700 [SAL] [S3UDFS] [s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq]: virtual Vertica::UDFileOperator* SAL::S3FileSystem::open(const char*, int, mode_t, bool) const, Opened file
2018-02-12 16:47:04.660 PlanParquetFileQueue:7f2d24ffa700 [SAL] [S3UDFS] [s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq]: seek=123843460 (0)
2018-02-12 16:47:04.660 PlanParquetFileQueue:7f2d24ffa700 [SAL] [S3UDFS] [s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq]: virtual size_t SAL::S3FileOperator::read(void*, size_t), 8
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [VMPI] Hurry: processing for session v_verticadb_node0001-321247:0x54744, call 18 of type 8
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [Dist] Dist::cancelPlan: canceling state NO EXECUTION(0): a0000000042fa7
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [Dist] Dist::cancelPlan: canceled state NO EXECUTION(0): a0000000042fa7
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [Dist] Dist::cancelPlan: canceling state CANCELED(4): a0000000042fa7
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [Dist] Dist::cancelPlan: already canceled a0000000042fa7
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [Dist] Dist::cancelPlan: canceled state CANCELED(4): a0000000042fa7
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [SAL] HDFSWatchdog is canceling HDFS operations in (txn=a0000000042f88,statement=4)
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [SAL] An HDFS s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq operation in (txn=a0000000042f88,statement=4) is going to be canceled
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [SAL] [S3UDFS] Initiating cancel for [s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq]
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [SAL] [S3UDFS] S3FileCancelHandle:: cancelling (txn=45035996273979272,statement=4)
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [SAL] An HDFS read : s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq operation in (txn=a0000000042f88,statement=4) is going to be canceled
2018-02-12 16:47:04.703 Spread Client:7f2d04ff9700 [SAL] HDFSWatchdog is done canceling HDFS operations in (txn=a0000000042f88,statement=4)
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d24ffa700 [SAL] [S3UDFS] [SDK] [ERROR] 2018-02-12 16:47:04 AWSClient [139831870990080] No response body. Response code: 4294967295
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d24ffa700 [SAL] [S3UDFS] [s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq]: virtual void SAL::S3FileSystem::close(Vertica::UDFileOperator*) const, Closed file
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d24ffa700 [SAL] [S3UDFS] [s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq]: void SAL::S3FileOperator::finalize(),
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d24ffa700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d2665e700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d0f7fe700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d0effd700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d0e7fc700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d0dffb700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d05ffe700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2d057fd700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2cfe7fc700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 PlanParquetFileQueue:7f2cfd7fa700 [Util] ActiveQueue: processing done
2018-02-12 16:47:04.749 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Removing (txn=a0000000042f88,statement=4( from HDFS cancel callback monitoring
2018-02-12 16:47:04.774 Init Session:7f2d0ffff700-a0000000042f88 @v_verticadb_node0002: 42501/6764: Error opening file [s3://bucket/data/1a4701af353ea0f2-773898e30000000f_415672635_data.0.parq] for read: Access Denied
LOCATION: initialize, /scratch_a/release/svrtar6887/vbuild/vertica/Util/ParquetInterface.cpp:173
2018-02-12 16:47:04.774 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Adding (txn=a0000000042f88,statement=4) to HDFS cancel callback monitoring
2018-02-12 16:47:04.774 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Removing (txn=a0000000042f88,statement=4( from HDFS cancel callback monitoring
2018-02-12 16:47:04.774 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Adding (txn=a0000000042f88,statement=4) to HDFS cancel callback monitoring
2018-02-12 16:47:04.774 Init Session:7f2d0ffff700-a0000000042f88 [SAL] Removing (txn=a0000000042f88,statement=4( from HDFS cancel callback monitoring
2018-02-12 16:47:04.776 Init Session:7f2d0ffff700-a0000000042f88 @v_verticadb_node0001: VX001/7253: Error reading from Parquet parser input stream [s3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq]: S3 operation (read) was canceled
LOCATION: ReadUdfo,
Thanks. So when you specify to copy just 's3://bucket/data/1a4701af353ea0f2-773898e300000016_386770378_data.0.parq' it works without error? Could you attach those logs?
Hi!
May be its just a log message and nothing more, but protocol listed twice:
file path is -
s3://s3://bucket....
that was a typo during replace, I can't show my production bucket..thats not related
mtracy here are the logs
2018-02-12 17:31:47.140 EEThread:7f2cfe7fc700-a0000000043107 [SAL] serviced from reservation: 24 requested 3706984 used 4194304 reserved
2018-02-12 17:31:47.140 EEThread:7f2cfe7fc700-a0000000043107 [SAL] serviced from reservation: 24 requested 3707008 used 4194304 reserved
2018-02-12 17:31:47.140 EEThread:7f2cfe7fc700-a0000000043107 [SAL] set savepoint for 23201 statement 2 -> block 9054, offset 11
2018-02-12 17:31:47.156 Init Session:7f2d05ffe700-a0000000043107 [SAL] Removing (txn=a0000000043107,statement=2( from HDFS cancel callback monitoring
2018-02-12 17:31:47.174 Init Session:7f2d05ffe700-a0000000043107 [SAL] Adding (txn=a0000000043107,statement=2) to HDFS cancel callback monitoring
2018-02-12 17:31:47.175 Init Session:7f2d05ffe700-a0000000043107 [SAL] Removing (txn=a0000000043107,statement=2( from HDFS cancel callback monitoring
2018-02-12 17:31:47.175 Init Session:7f2d05ffe700-a0000000043107 [VMPI] [abandon plan] - New synced txn catalog version : 188 (Old : 0)
2018-02-12 17:31:47.180 Init Session:7f2d05ffe700-a0000000043107 [Txn] Starting Commit: Txn: a0000000043107 'COPY poc.hrly_s3 FROM 's3://tableau-360-production/v_hrly_flash_dtl_s3/1a4701af353ea0f2-773898e30000001b_1645706297_data.0.parq' PARQUET;' 1804
2018-02-12 17:31:47.180 Init Session:7f2d05ffe700-a0000000043107 [SAL] Adding (txn=a0000000043107,statement=2) to HDFS cancel callback monitoring exclusion list
2018-02-12 17:31:47.180 Init Session:7f2d05ffe700-a0000000043107 [TM] updateCPE: Updating projection hrly_s3_b1 CPE to 0x108 from 0xffffffffffffffff for wos 0xa000000036f926
2018-02-12 17:31:47.182 Init Session:7f2d05ffe700-a0000000043107 [TM] updateCPE: Updating projection hrly_s3_b0 CPE to 0x108 from 0xffffffffffffffff for wos 0xa000000036f866
2018-02-12 17:31:47.185 Init Session:7f2d05ffe700-a0000000043107 [SAL] Removing (txn=a0000000043107,statement=2) from HDFS cancel callback monitoring exclusion list
2018-02-12 17:31:47.186 Init Session:7f2d05ffe700 [Txn] Commit Complete: Txn: a0000000043107 at epoch 0x109 and new global catalog version 1805
2018-02-12 17:31:47.186 Init Session:7f2d05ffe700 [EE] [Bulkload] AUTO copy targeting hrly_s3 from pipe committed at version 1805.
2018-02-12 17:31:49.000 AnalyzeRowCount:7f2cf6ffd700 [Command] TMTask: row count analyze - start
2018-02-12 17:31:49.002 AnalyzeRowCount:7f2cf6ffd700-a000000004310a [Txn] Begin Txn: a000000004310a 'getRowCountsForProj'
2018-02-12 17:31:49.078 AnalyzeRowCount:7f2cf6ffd700-a000000004310a [Txn] Rollback Txn: a000000004310a 'getRowCountsForProj'
2018-02-12 17:31:49.107 AnalyzeRowCount:7f2cf6ffd700-a000000004310b [Txn] Begin Txn: a000000004310b 'analyze_row_count'
2018-02-12 17:31:49.109 AnalyzeRowCount:7f2cf6ffd700-a000000004310b [VMPI] [commit transaction] - New synced txn catalog version : 1 (Old : 0)
2018-02-12 17:31:49.110 AnalyzeRowCount:7f2cf6ffd700-a000000004310b [Txn] Starting Commit: Txn: a000000004310b 'analyze_row_count' 1805
2018-02-12 17:31:49.110 AnalyzeRowCount:7f2cf6ffd700-a000000004310b [SAL] Adding (txn=a000000004310b,statement=1) to HDFS cancel callback monitoring exclusion list
2018-02-12 17:31:49.111 AnalyzeRowCount:7f2cf6ffd700-a000000004310b [SAL] Removing (txn=a000000004310b,statement=1) from HDFS cancel callback monitoring exclusion list
2018-02-12 17:31:49.111 AnalyzeRowCount:7f2cf6ffd700 [Txn] Commit Complete: Txn: a000000004310b at epoch 0x10a and new global catalog version 1806
2018-02-12 17:31:49.113 AnalyzeRowCount:7f2cf6ffd700 [Command] TMTask: row count analyze - Done
2018-02-12 17:31:49.114 AnalyzeRowCount:7f2cf6ffd700 [Util] Task 'AnalyzeRowCount' enabled
2018-02-12 17:31:50.000 DiskSpaceRefresher:7f2d057fd700 [Util] Task 'DiskSpaceRefresher' enabled
2018-02-12 17:31:50.078 nameless:7f2da1fcb700 [SAL] Running runWatchdog
2018-02-12 17:31:50.078 nameless:7f2da1fcb700 [SAL] Watchdog is checking for idle connections
Thanks Ankit. One more quick question. What are the permissions of the files with that path prefix? Just trying to get enough information to reproduce this on our end.
well we use IAM accounts and I have full access to the bucket.
That is why the individual COPY works, but not sure why * fails
Tracy any update. were you able to reproduce.
Not yet. It appears as though we have similar setups in many of our tests, so we are trying to identify what could be making your configuration different, as well as how to best proceed.
Could you check if you have list permission on IAM? That is the only reason single file read works, but glob does not. For glob we need to send list request to S3.
I checked I have the full access to all the buckets in the system
Could you use awscli to list the same bucket from the same node? Does that work?
here you go
aws s3 ls s3://bucket/data/
2018-02-13 14:40:31 79724994 a44372205c100f5b-4e719f8800000010_1273429912_data.0.parq
2018-02-13 14:40:31 60918026 a44372205c100f5b-4e719f8800000012_835150974_data.0.parq
2018-02-13 14:40:29 41914935 a44372205c100f5b-4e719f8800000015_28682028_data.0.parq
2018-02-13 14:40:29 51045058 a44372205c100f5b-4e719f8800000016_1842451061_data.0.parq
2018-02-13 14:40:29 10284619 a44372205c100f5b-4e719f8800000017_312849041_data.0.parq
2018-02-13 14:40:29 43433695 a44372205c100f5b-4e719f8800000019_726107042_data.0.parq
2018-02-13 14:40:30 73811774 a44372205c100f5b-4e719f880000001b_312700690_data.0.parq
How many nodes do you have in your cluster? Here is how Vertica works when you have a glob pattern loading from S3: we list the files on the initiator, and then send the list of files to the executors to read in parallel, which is slightly different than a single node read.
The next thing I would check is to make sure all the nodes have read permission. The error is Access Denied and we are returning that directly from S3 API, so there must be a permission issue at least on one of the nodes.
hey Anikt - @ankit0007smart I am new to vertica . I have a use case to load the .csv file from AWS s3 bucket to one of my vertica table . Is that doable technically ? can we load the data to vertica from AWS s3 for .csv file ?
Command i was trying to use is -->
copy . SOURCE S3(url='s3:///<filename.csv>');
Please let me know if you have done this similar in past or now ?
Sure, this is possible, but your comment doesn't include the complete vsql command. Here is an example that I use for test:
SELECT SET_CONFIG_PARAMETER('AWSRegion','us-east-1');
SELECT AWS_SET_CONFIG('aws_id', 'X');
SELECT AWS_SET_CONFIG('aws_secret', 'X');
CREATE SCHEMA bryanh;
-- create table and copy data into table
CREATE TABLE bryanh.testcsv (testID int, testDate datetime, testInt int, testChar varchar(16));
COPY bryanh.testcsv WITH SOURCE S3(url='S3://bryanh_bucket/bryan_s3_test.csv') DELIMITER ',';
-- create external table: access data stored in S3
CREATE EXTERNAL TABLE bryanh.tests3csv (testID int, testDate datetime, testInt int, testChar varchar(16)) AS COPY FROM 'S3://bryanh_bucket/bryan_s3_test.csv' DELIMITER ',';
If you still have issues, please open a new thread and post the COPY or CREATE EXTERNAL TABLE command as well as any errors that appear in vertica.log on the node where you submit the query. Thanks!
Depending on the version of Vertica you use, you have one or two ways to load data from S3:
1. If you run Vertica 9.1+, you may use S3 file system which is the recommended option, and the fastest way to load from S3.
2. Use AWS UDSource as Bryan said above. Only note that in that case, you need to set your region by calling aws_set_config('aws_region', 'X') and not SET_CONFIG_PARAMETER('AWSRegion','us-east-1'), as the latter sets the region for S3 file system (option 1).