How to specify reject rows/exception rows table while saving the dataframe to vertica.

mt25mt25
edited September 2021 in General Discussion

Hi,

I am trying to capture the erroneous rows for further analysis. Is there a way in vertica spark connector to specify the rejected rows table or exception table. If not what could be the approach if want to capture the bad records.

I am using Vertica 9.2 .
Spark 2.4.0
Scala 2.11.12

Best Answers

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited September 2021 Answer ✓

    I will check around, but I do not think there is a way to modify the process.

    The purpose of that temp table is to evaluate the number of rejected rows and fails if a specified threshold (set with the failed_rows_percent_tolerance parameter) is reached.

    failed_rows_percent_tolerance = The tolerance level for failed rows, as a percentage. For example, to specify that the job fail if greater than 10% of the rows are rejected, specify this value as 0.10 for 10% tolerance. Default Value: 0.00

    I wonder what would happen if you set failed_rows_percent_tolerance = 1?

    Note, a sample of the rejected rows should be printed to the Spark log file on the relevant Spark node when applicable.

    And you can also check the DC_ERRORS table after a load. Any errors shoud appear in there. You can search by transaction and statement id.

  • mt25mt25
    edited September 2021 Answer ✓

    Hi Jim,

    I tried setting the failed_rows_percent_tolerance=1; expecting to see a erroneous row in spark log, but unfortunately errored row is not coming in spark log.

    2021-09-06 08:25:02,796 [ERROR] com.reports.common.functions$ - failed to execute Job: loadPlanHeader Error Trace: S2V: FATAL ERROR for job S2V_job7834802174759143388. Job status information is available in the Vertica table schema.S2V_JOB_STATUS_USER_DBADMIN. Unable to create/insert into target table: schema.table_name with SaveMode: Append. ERROR MESSAGE: ERROR: java.sql.SQLDataException: [Vertica]VJDBC ERROR**: Datatype mismatch: column 4 in the orc source **[hdfs://hostname:8020/user/hadoop/planstream/S2V_job7834802174759143388/part-00005-387744d1-2366-4fe7-80f0-15b3ea555164-c000.snappy.orc] has type STRING, expected timestamp.

    Df.write
    .format(verticaDataSource)
    .options(verticaProperties)
    .option("table", "table_name")
    .option("failed_rows_percent_tolerance",1.0)
    .mode("append")
    .save()

    It is saying the cause of error but not the actual row where this column mismatch is occurred.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited September 2021 Answer ✓

    You are going to have to change the data type of the table's column to VARCHAR, or manually transform the Orc STRING to a Vertica TIMESTAMP during the load.

    Example...

    verticademos=> EXPORT TO ORC(directory='/home/dbadmin/orc_data') AS SELECT '2021/09/06 12:00PM' date_string;
     Rows Exported
    ---------------
                 1
    (1 row)
    
    verticademos=> \! ls /home/dbadmin/orc_data
    9f7b9578-v_verticademos_node0001-140026658150144-0.orc
    
    verticademos=> CREATE TABLE load_orc_data (date_timestamp TIMESTAMP);
    CREATE TABLE
    
    verticademos=> COPY load_orc_data FROM '/home/dbadmin/orc_data/*.orc' ORC REJECTED DATA TABLE load_orc_data_bad;
    ERROR 9699:  Datatype mismatch: column "date_timestamp" in the ORC source [/home/dbadmin/orc_data/9f7b9578-v_verticademos_node0001-140026658150144-0.orc] has type STRING, expected timestamp
    

    This type of error (data coercion / where ALL rows will fail to load) is not added to the Rejected Data Table.

    But they show up in DC_ERRORS and DC_LOAD_EVENTS:

    verticademos=> SELECT log_message FROM dc_errors ORDER BY time DESC LIMIT 1;
                                                                                       log_message
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Datatype mismatch: column "date_timestamp" in the ORC source [/home/dbadmin/orc_data/9f7b9578-v_verticademos_node0001-140026658150144-0.orc] has type STRING, expected timestamp
    (1 rows)
    
    verticademos=> SELECT event_description FROM dc_load_events WHERE event_type = 'SOURCEABORT' ORDER BY time DESC LIMIT 1;
                                                                                           event_description                                                                           
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     ERROR 9699:  Datatype mismatch: column "date_timestamp" in the ORC source [/home/dbadmin/orc_data/9f7b9578-v_verticademos_node0001-140026658150144-0.orc] has type STRING, expected timestamp
    (1 row)
    

    You can transform the STRING to a TIMESTAMP using a FILLER:

    verticademos=> COPY load_orc_data (date_timestamp_f FILLER VARCHAR, date_timestamp AS date_timestamp_f::TIMESTAMP) FROM '/home/dbadmin/orc_data/*.orc' ORC REJECTED DATA TABLE load_orc_data_bad;
     Rows Loaded
    -------------
               1
    (1 row)
    
    verticademos=> SELECT * FROM load_orc_data;
       date_timestamp
    ---------------------
     2021-09-06 12:00:00
    (1 row)
    

    See: Transforming Data During Loads

Answers

  • After analyzing the vertica.log looks like it creates the temporary rejecte table but what is the use if these tables are not available to debug. Is there anything that I am missing.

    entry from log file->
    COPY "schemaname"."table_name" ("oid","versionId","name","maxLineId","maxDigitalLineId","maxLinkId","primary","secondary","deleted","totalDollars","lastModifiedBy","lastModifiedDate","modifiedTime","tenant","topic","offsets","partition","opCode","batchid","etl_job_id","etl_timestamp") FROM 'hdfs://hostname:8020/user/hadoop/S2V_job5293504704155335249/*.orc' ON ANY NODE orc DIRECT REJECTED DATA AS TABLE "plan_header_version_S2V_job5293504704155335249_REJECTS" NO COMMIT

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    It's the NO COMMIT on the COPY commnad that is causing Vertica to create a local external temp table vs a permanent external table.

    Example:

    With the NO COMMIT, the local external temp table doesn't exist after the session ends:

    verticademos=> EXPORT TO ORC (directory='/home/dbadmin/orc_load') AS SELECT 'ABC' c1, NULL c2 FROM dual;
     Rows Exported
    ---------------
                 1
    (1 row)
    
    verticademos=> CREATE TABLE orc_data (c1 VARCHAR, c2 VARCHAR NOT NULL);
    CREATE TABLE
    
    verticademos=> COPY orc_data FROM '/home/dbadmin/orc_load/*.orc' ON ANY NODE ORC REJECTED DATA TABLE orc_data_bad NO COMMIT;
    NOTICE 5795:  COPY is not going to commit.  Creating a TEMPORARY rejection table, which will be dropped at end-of-session
    WARNING 9972:  Some rows were rejected by the orc parser
     Rows Loaded
    -------------
               0
    (1 row)
    
    verticademos=> SELECT file_name, rejected_data, rejected_reason FROM orc_data_bad;
                                       file_name                                   | rejected_data |                    rejected_reason
    -------------------------------------------------------------------------------+---------------+-------------------------------------------------------
     /home/dbadmin/orc_load/24252569-v_verticademos_node0001-140026805483264-0.orc | NULL          | In column 2: Cannot set NULL value in NOT NULL column
    (1 row)
    
    verticademos=> \q
    
    [dbadmin@vertica ~]$ vsql
    Welcome to vsql, the Vertica Analytic Database interactive terminal.
    
    Type:  \h or \? for help with vsql commands
           \g or terminate with semicolon to execute query
           \q to quit
    
    verticademos=> SELECT file_name, rejected_data, rejected_reason FROM orc_data_bad;
    ERROR 4566:  Relation "orc_data_bad" does not exist
    
    

    Without the NO COMMIT, a permenant external table is created instead, and it'll be around until it's dropped!

    [dbadmin@vertica ~]$ vsql
    Welcome to vsql, the Vertica Analytic Database interactive terminal.
    
    Type:  \h or \? for help with vsql commands
           \g or terminate with semicolon to execute query
           \q to quit
    
    verticademos=> COPY orc_data FROM '/home/dbadmin/orc_load/*.orc' ON ANY NODE ORC REJECTED DATA TABLE orc_data_bad;
    WARNING 9972:  Some rows were rejected by the orc parser
     Rows Loaded
    -------------
               0
    (1 row)
    
    verticademos=> SELECT file_name, rejected_data, rejected_reason FROM orc_data_bad;
                                       file_name                                   | rejected_data |                    rejected_reason
    -------------------------------------------------------------------------------+---------------+-------------------------------------------------------
     /home/dbadmin/orc_load/24252569-v_verticademos_node0001-140026805483264-0.orc | NULL          | In column 2: Cannot set NULL value in NOT NULL column
    (1 row)
    
    verticademos=> \q
    
    [dbadmin@vertica ~]$ vsql
    Welcome to vsql, the Vertica Analytic Database interactive terminal.
    
    Type:  \h or \? for help with vsql commands
           \g or terminate with semicolon to execute query
           \q to quit
    
    verticademos=> SELECT file_name, rejected_data, rejected_reason FROM orc_data_bad;
                                       file_name                                   | rejected_data |                    rejected_reason
    -------------------------------------------------------------------------------+---------------+-------------------------------------------------------
     /home/dbadmin/orc_load/24252569-v_verticademos_node0001-140026805483264-0.orc | NULL          | In column 2: Cannot set NULL value in NOT NULL column
    (1 row)
    
  • Thanks Jim for your input. Since these copy commands are being prepared by the vertica spark connector, is there any way I can set the NO commit option to off?

  • Hi Jim,

    @Jim_Knicely I have tried to increase the failed row percentage from 0 to 1 (tried with .1/.3/.33 )but still I am not able to see the erroneous row in spark log, this is for the scenario where few row loaded and few failed

    Number of rows_rejected=1. rows_copied=2. failedRowsPercent=0.3333333333333333. user's failed_rows_percent_tolerance=1. passedFaultToleranceTest=true...PASSED. OK to commit to database.

    thanks

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file