How to specify reject rows/exception rows table while saving the dataframe to vertica.
Hi,
I am trying to capture the erroneous rows for further analysis. Is there a way in vertica spark connector to specify the rejected rows table or exception table. If not what could be the approach if want to capture the bad records.
I am using Vertica 9.2 .
Spark 2.4.0
Scala 2.11.12
Best Answers
-
Jim_Knicely
- Select Field - Administrator
I will check around, but I do not think there is a way to modify the process.
The purpose of that temp table is to evaluate the number of rejected rows and fails if a specified threshold (set with the failed_rows_percent_tolerance parameter) is reached.
failed_rows_percent_tolerance = The tolerance level for failed rows, as a percentage. For example, to specify that the job fail if greater than 10% of the rows are rejected, specify this value as 0.10 for 10% tolerance. Default Value: 0.00
I wonder what would happen if you set failed_rows_percent_tolerance = 1?
Note, a sample of the rejected rows should be printed to the Spark log file on the relevant Spark node when applicable.
And you can also check the DC_ERRORS table after a load. Any errors shoud appear in there. You can search by transaction and statement id.
0 -
mt25
✭✭
Hi Jim,
I tried setting the failed_rows_percent_tolerance=1; expecting to see a erroneous row in spark log, but unfortunately errored row is not coming in spark log.
2021-09-06 08:25:02,796 [ERROR] com.reports.common.functions$ - failed to execute Job: loadPlanHeader Error Trace: S2V: FATAL ERROR for job S2V_job7834802174759143388. Job status information is available in the Vertica table schema.S2V_JOB_STATUS_USER_DBADMIN. Unable to create/insert into target table: schema.table_name with SaveMode: Append. ERROR MESSAGE: ERROR: java.sql.SQLDataException: [Vertica]VJDBC ERROR**: Datatype mismatch: column 4 in the orc source **[hdfs://hostname:8020/user/hadoop/planstream/S2V_job7834802174759143388/part-00005-387744d1-2366-4fe7-80f0-15b3ea555164-c000.snappy.orc] has type STRING, expected timestamp.
Df.write
.format(verticaDataSource)
.options(verticaProperties)
.option("table", "table_name")
.option("failed_rows_percent_tolerance",1.0)
.mode("append")
.save()It is saying the cause of error but not the actual row where this column mismatch is occurred.
0 -
Jim_Knicely
- Select Field - Administrator
You are going to have to change the data type of the table's column to VARCHAR, or manually transform the Orc STRING to a Vertica TIMESTAMP during the load.
Example...
verticademos=> EXPORT TO ORC(directory='/home/dbadmin/orc_data') AS SELECT '2021/09/06 12:00PM' date_string; Rows Exported --------------- 1 (1 row) verticademos=> \! ls /home/dbadmin/orc_data 9f7b9578-v_verticademos_node0001-140026658150144-0.orc verticademos=> CREATE TABLE load_orc_data (date_timestamp TIMESTAMP); CREATE TABLE verticademos=> COPY load_orc_data FROM '/home/dbadmin/orc_data/*.orc' ORC REJECTED DATA TABLE load_orc_data_bad; ERROR 9699: Datatype mismatch: column "date_timestamp" in the ORC source [/home/dbadmin/orc_data/9f7b9578-v_verticademos_node0001-140026658150144-0.orc] has type STRING, expected timestampThis type of error (data coercion / where ALL rows will fail to load) is not added to the Rejected Data Table.
But they show up in DC_ERRORS and DC_LOAD_EVENTS:
verticademos=> SELECT log_message FROM dc_errors ORDER BY time DESC LIMIT 1; log_message ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Datatype mismatch: column "date_timestamp" in the ORC source [/home/dbadmin/orc_data/9f7b9578-v_verticademos_node0001-140026658150144-0.orc] has type STRING, expected timestamp (1 rows) verticademos=> SELECT event_description FROM dc_load_events WHERE event_type = 'SOURCEABORT' ORDER BY time DESC LIMIT 1; event_description ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ERROR 9699: Datatype mismatch: column "date_timestamp" in the ORC source [/home/dbadmin/orc_data/9f7b9578-v_verticademos_node0001-140026658150144-0.orc] has type STRING, expected timestamp (1 row)You can transform the STRING to a TIMESTAMP using a FILLER:
verticademos=> COPY load_orc_data (date_timestamp_f FILLER VARCHAR, date_timestamp AS date_timestamp_f::TIMESTAMP) FROM '/home/dbadmin/orc_data/*.orc' ORC REJECTED DATA TABLE load_orc_data_bad; Rows Loaded ------------- 1 (1 row) verticademos=> SELECT * FROM load_orc_data; date_timestamp --------------------- 2021-09-06 12:00:00 (1 row)0
Answers
After analyzing the vertica.log looks like it creates the temporary rejecte table but what is the use if these tables are not available to debug. Is there anything that I am missing.
entry from log file->
COPY "schemaname"."table_name" ("oid","versionId","name","maxLineId","maxDigitalLineId","maxLinkId","primary","secondary","deleted","totalDollars","lastModifiedBy","lastModifiedDate","modifiedTime","tenant","topic","offsets","partition","opCode","batchid","etl_job_id","etl_timestamp") FROM 'hdfs://hostname:8020/user/hadoop/S2V_job5293504704155335249/*.orc' ON ANY NODE orc DIRECT REJECTED DATA AS TABLE "plan_header_version_S2V_job5293504704155335249_REJECTS" NO COMMIT
It's the NO COMMIT on the COPY commnad that is causing Vertica to create a local external temp table vs a permanent external table.
Example:
With the NO COMMIT, the local external temp table doesn't exist after the session ends:
verticademos=> EXPORT TO ORC (directory='/home/dbadmin/orc_load') AS SELECT 'ABC' c1, NULL c2 FROM dual; Rows Exported --------------- 1 (1 row) verticademos=> CREATE TABLE orc_data (c1 VARCHAR, c2 VARCHAR NOT NULL); CREATE TABLE verticademos=> COPY orc_data FROM '/home/dbadmin/orc_load/*.orc' ON ANY NODE ORC REJECTED DATA TABLE orc_data_bad NO COMMIT; NOTICE 5795: COPY is not going to commit. Creating a TEMPORARY rejection table, which will be dropped at end-of-session WARNING 9972: Some rows were rejected by the orc parser Rows Loaded ------------- 0 (1 row) verticademos=> SELECT file_name, rejected_data, rejected_reason FROM orc_data_bad; file_name | rejected_data | rejected_reason -------------------------------------------------------------------------------+---------------+------------------------------------------------------- /home/dbadmin/orc_load/24252569-v_verticademos_node0001-140026805483264-0.orc | NULL | In column 2: Cannot set NULL value in NOT NULL column (1 row) verticademos=> \q [dbadmin@vertica ~]$ vsql Welcome to vsql, the Vertica Analytic Database interactive terminal. Type: \h or \? for help with vsql commands \g or terminate with semicolon to execute query \q to quit verticademos=> SELECT file_name, rejected_data, rejected_reason FROM orc_data_bad; ERROR 4566: Relation "orc_data_bad" does not existWithout the NO COMMIT, a permenant external table is created instead, and it'll be around until it's dropped!
[dbadmin@vertica ~]$ vsql Welcome to vsql, the Vertica Analytic Database interactive terminal. Type: \h or \? for help with vsql commands \g or terminate with semicolon to execute query \q to quit verticademos=> COPY orc_data FROM '/home/dbadmin/orc_load/*.orc' ON ANY NODE ORC REJECTED DATA TABLE orc_data_bad; WARNING 9972: Some rows were rejected by the orc parser Rows Loaded ------------- 0 (1 row) verticademos=> SELECT file_name, rejected_data, rejected_reason FROM orc_data_bad; file_name | rejected_data | rejected_reason -------------------------------------------------------------------------------+---------------+------------------------------------------------------- /home/dbadmin/orc_load/24252569-v_verticademos_node0001-140026805483264-0.orc | NULL | In column 2: Cannot set NULL value in NOT NULL column (1 row) verticademos=> \q [dbadmin@vertica ~]$ vsql Welcome to vsql, the Vertica Analytic Database interactive terminal. Type: \h or \? for help with vsql commands \g or terminate with semicolon to execute query \q to quit verticademos=> SELECT file_name, rejected_data, rejected_reason FROM orc_data_bad; file_name | rejected_data | rejected_reason -------------------------------------------------------------------------------+---------------+------------------------------------------------------- /home/dbadmin/orc_load/24252569-v_verticademos_node0001-140026805483264-0.orc | NULL | In column 2: Cannot set NULL value in NOT NULL column (1 row)Thanks Jim for your input. Since these copy commands are being prepared by the vertica spark connector, is there any way I can set the NO commit option to off?
Hi Jim,
@Jim_Knicely I have tried to increase the failed row percentage from 0 to 1 (tried with .1/.3/.33 )but still I am not able to see the erroneous row in spark log, this is for the scenario where few row loaded and few failed
Number of rows_rejected=1. rows_copied=2. failedRowsPercent=0.3333333333333333. user's failed_rows_percent_tolerance=1. passedFaultToleranceTest=true...PASSED. OK to commit to database.thanks