How to specify reject rows/exception rows table while saving the dataframe to vertica.
Hi,
I am trying to capture the erroneous rows for further analysis. Is there a way in vertica spark connector to specify the rejected rows table or exception table. If not what could be the approach if want to capture the bad records.
I am using Vertica 9.2 .
Spark 2.4.0
Scala 2.11.12
Best Answers
-
Jim_Knicely - Select Field - Administrator
I will check around, but I do not think there is a way to modify the process.
The purpose of that temp table is to evaluate the number of rejected rows and fails if a specified threshold (set with the failed_rows_percent_tolerance parameter) is reached.
failed_rows_percent_tolerance = The tolerance level for failed rows, as a percentage. For example, to specify that the job fail if greater than 10% of the rows are rejected, specify this value as 0.10 for 10% tolerance. Default Value: 0.00
I wonder what would happen if you set failed_rows_percent_tolerance = 1?
Note, a sample of the rejected rows should be printed to the Spark log file on the relevant Spark node when applicable.
And you can also check the DC_ERRORS table after a load. Any errors shoud appear in there. You can search by transaction and statement id.
0 -
mt25 ✭✭
Hi Jim,
I tried setting the failed_rows_percent_tolerance=1; expecting to see a erroneous row in spark log, but unfortunately errored row is not coming in spark log.
2021-09-06 08:25:02,796 [ERROR] com.reports.common.functions$ - failed to execute Job: loadPlanHeader Error Trace: S2V: FATAL ERROR for job S2V_job7834802174759143388. Job status information is available in the Vertica table schema.S2V_JOB_STATUS_USER_DBADMIN. Unable to create/insert into target table: schema.table_name with SaveMode: Append. ERROR MESSAGE: ERROR: java.sql.SQLDataException: [Vertica]VJDBC ERROR**: Datatype mismatch: column 4 in the orc source **[hdfs://hostname:8020/user/hadoop/planstream/S2V_job7834802174759143388/part-00005-387744d1-2366-4fe7-80f0-15b3ea555164-c000.snappy.orc] has type STRING, expected timestamp.
Df.write
.format(verticaDataSource)
.options(verticaProperties)
.option("table", "table_name")
.option("failed_rows_percent_tolerance",1.0)
.mode("append")
.save()It is saying the cause of error but not the actual row where this column mismatch is occurred.
0 -
Jim_Knicely - Select Field - Administrator
You are going to have to change the data type of the table's column to VARCHAR, or manually transform the Orc STRING to a Vertica TIMESTAMP during the load.
Example...
verticademos=> EXPORT TO ORC(directory='/home/dbadmin/orc_data') AS SELECT '2021/09/06 12:00PM' date_string; Rows Exported --------------- 1 (1 row) verticademos=> \! ls /home/dbadmin/orc_data 9f7b9578-v_verticademos_node0001-140026658150144-0.orc verticademos=> CREATE TABLE load_orc_data (date_timestamp TIMESTAMP); CREATE TABLE verticademos=> COPY load_orc_data FROM '/home/dbadmin/orc_data/*.orc' ORC REJECTED DATA TABLE load_orc_data_bad; ERROR 9699: Datatype mismatch: column "date_timestamp" in the ORC source [/home/dbadmin/orc_data/9f7b9578-v_verticademos_node0001-140026658150144-0.orc] has type STRING, expected timestamp
This type of error (data coercion / where ALL rows will fail to load) is not added to the Rejected Data Table.
But they show up in DC_ERRORS and DC_LOAD_EVENTS:
verticademos=> SELECT log_message FROM dc_errors ORDER BY time DESC LIMIT 1; log_message ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Datatype mismatch: column "date_timestamp" in the ORC source [/home/dbadmin/orc_data/9f7b9578-v_verticademos_node0001-140026658150144-0.orc] has type STRING, expected timestamp (1 rows) verticademos=> SELECT event_description FROM dc_load_events WHERE event_type = 'SOURCEABORT' ORDER BY time DESC LIMIT 1; event_description ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ERROR 9699: Datatype mismatch: column "date_timestamp" in the ORC source [/home/dbadmin/orc_data/9f7b9578-v_verticademos_node0001-140026658150144-0.orc] has type STRING, expected timestamp (1 row)
You can transform the STRING to a TIMESTAMP using a FILLER:
verticademos=> COPY load_orc_data (date_timestamp_f FILLER VARCHAR, date_timestamp AS date_timestamp_f::TIMESTAMP) FROM '/home/dbadmin/orc_data/*.orc' ORC REJECTED DATA TABLE load_orc_data_bad; Rows Loaded ------------- 1 (1 row) verticademos=> SELECT * FROM load_orc_data; date_timestamp --------------------- 2021-09-06 12:00:00 (1 row)
0
Answers
After analyzing the vertica.log looks like it creates the temporary rejecte table but what is the use if these tables are not available to debug. Is there anything that I am missing.
entry from log file->
COPY "schemaname"."table_name" ("oid","versionId","name","maxLineId","maxDigitalLineId","maxLinkId","primary","secondary","deleted","totalDollars","lastModifiedBy","lastModifiedDate","modifiedTime","tenant","topic","offsets","partition","opCode","batchid","etl_job_id","etl_timestamp") FROM 'hdfs://hostname:8020/user/hadoop/S2V_job5293504704155335249/*.orc' ON ANY NODE orc DIRECT REJECTED DATA AS TABLE "plan_header_version_S2V_job5293504704155335249_REJECTS" NO COMMIT
It's the NO COMMIT on the COPY commnad that is causing Vertica to create a local external temp table vs a permanent external table.
Example:
With the NO COMMIT, the local external temp table doesn't exist after the session ends:
Without the NO COMMIT, a permenant external table is created instead, and it'll be around until it's dropped!
Thanks Jim for your input. Since these copy commands are being prepared by the vertica spark connector, is there any way I can set the NO commit option to off?
Hi Jim,
@Jim_Knicely I have tried to increase the failed row percentage from 0 to 1 (tried with .1/.3/.33 )but still I am not able to see the erroneous row in spark log, this is for the scenario where few row loaded and few failed
Number of rows_rejected=1. rows_copied=2. failedRowsPercent=0.3333333333333333. user's failed_rows_percent_tolerance=1. passedFaultToleranceTest=true...PASSED. OK to commit to database.
thanks