VerticaCopyStream and "REJECTED DATA AS TABLE"
Hello.
I use JDBC VerticaCopyStream to transfer data to the database.
I would like to keep the problem data in the database instead of going back to the server when I upload it.
It is look like "REJECTED DATA AS TABLE" COPY option is good variant for me.
My SQL statement for COPY looks something like
COPY "my_schema"."my_table" (k,v,d) FROM STDIN UNCOMPRESSED REJECTED DATA AS TABLE loader_rejects DELIMITER AS E'\002' NULL AS E'\004' ESCAPE AS E'\001' RECORD TERMINATOR E'\003' DIRECT
Unfortunetely I rejected error Cannot do LOCAL and REJECTED DATA AS TABLE
All stack trace look like here
Cause: com.vertica.support.exceptions.DataException: [Vertica][VJDBC](5767) ERROR: Cannot do LOCAL and REJECTED DATA AS TABLE in the same query; rejected records can only be saved to one location at com.vertica.util.ServerErrorData.buildException(Unknown Source) at com.vertica.io.ProtocolStream.readExpectedMessage(Unknown Source) at com.vertica.io.ProtocolStream.readExpectedMessage(Unknown Source) at com.vertica.dataengine.VQueryExecutor.readCopyStartResponse(Unknown Source) at com.vertica.dataengine.VQueryExecutor.handleExecuteResponse(Unknown Source) at com.vertica.dataengine.VQueryExecutor.execute(Unknown Source) at com.vertica.jdbc.VerticaCopyStream.execute(Unknown Source) at com.vertica.jdbc.VerticaCopyStream.start(Unknown Source)
It's weird, because in my expression for Copy there is no option "LOCAL".
Are there any ways to use JDCB VerticaCopyStream and save bad data to the database?
Best Answers
-
ThePebble Vertica Employee Employee
VerticaCopyStream rewrites the query and adds FROM LOCAL and RETURNREJECTED so that it can support getRejected(). I guess the original designers didn't think anyone would want the rejected rows sent to two places. Plus, VerticaCopyStream was designed before FROM LOCAL and REJECTED DATA AS TABLE could work together. https://www.vertica.com/docs/9.2.x/HTML/JDBC/com/vertica/jdbc/VerticaCopyStream.html
What you could do is create a named pipe (mkfifo) and execute the statement without VerticaCopyStream.
5 -
twall Vertica Employee Employee
So copy local has a few different mechanisms for handling rejection/exception data, and most of them are mutually exclusive to one another. VerticaCopyStream lets you run a COPY t FROM LOCAL STDIN while abstracting some details away, some of that abstraction is a bit leaky unfortunately, which makes things a bit confusing.
COPY t FROM LOCAL 'file.txt' REJECTED DATA 'rejections.txt' EXCEPTIONS 'exceptions.txt'
This writes the rejection/exceptions files to paths on the client.COPY t FROM 'file.txt' REJECTED DATA 'rejections.txt' EXCEPTIONS 'exceptions.txt'
This writes the rejection/exceptions files to paths on the vertica nodes. You may need special storage location permissions to be able to write these files.COPY t FROM [LOCAL] 'file.txt' REJECTED DATA AS TABLE t_rejects
This writes the rejection/exceptions data to a table called t_rejects. It isn't a true vertica table, it is actually an external table (https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/ExternalTables/CreatingExternalTables.htm). Under the covers it is still writing exceptions/rejection files similar to the previous case, except they are in a different format that is more amenable to being used in an external table copy statement definition. For a long time, the LOCAL variant of COPY did not support this method; support for that was added in 9.1.x.When you use STDIN instead of a file path, the client application must provide a data feed to act as the STDIN. For vsql, we can literally read from standard input to provide that information. For a library like JDBC, we instead use VerticaCopyStream, a JDBC extension that lets you provide a java InputStream to act as stdin. JDBC also supports the standard batching APIs, when you write series of batched INSERT INTO t VALUES (?), the driver automatically converts it into an equivalent COPY t FROM LOCAL STDIN, where the ?'s are serialized into a data stream.
There is another option that clients use that is helpful for implementing client API's, but not very useful for applications: RETURNREJECTED
COPY t FROM LOCAL STDIN RETURNREJECTED
This returns changes the format of the rejection data returned to the client. Instead of the rejection/exception data, only the record numbers of the rows in the batch that failed to load are returned.Why would you willingly give up that extra information? Because that is the only information we are capable of serving to applications in a standard format -- the return value of JDBC's PreparedStatement.executeBatch() is an int [] of status codes for each row in the batch. VerticaCopyStream reuses that same infrastructure in order to implement the getRejects() method. It will inject the RETURNJRECTED keyword if no exceptions/rejections clauses are found since that is better than nothing. This is why the javadocs state that this API returns no answers if you provide your own REJECTED DATA/EXCEPTIONS clauses.
Of course, this all means that VerticaCopyStream can only do client-side rejection/exception files or row numbers via getRejects(). Currently it is not aware that servers can do REJECTED DATA AS TABLE with the LOCAL keyword. The client supports all versions of Vertica since 7.1 and not all have that feature. It should be updated to become aware of that, but we haven't done so yet. I've opened a feature request ticket to do so.
So your best path forward is probably one of the following options:
1) If your data is already in java objects, it is usually best to let the driver's batch insert interface do its thing. With that, you'd write an INSERT VALUES(?), bind with PreparedStatement.setXXX(), and the driver will automatically convert it to COPY t FROM LOCAL STDIN and present you rejected record numbers each batch. The driver knows how to serialize java objects properly, so you don't need to worry much about rejections/exceptions. The only reason something could be rejected is for application semantics -- the only I know of is if the data would be truncated because it is too long for the column, or if there is a null constraint violation. 2) If you already have the data in a java InputStream, it is your job to ensure the format of the inputstream parses correctly a. Use VerticaCopyStream, write your own COPY FROM LOCAL STDIN with the right parser settings that align with your InputStream format, and parse client-side files after the operation to handle exceptions b. Ditch VerticaCopyStream entirely, instead writing your InputStream to a named pipe file and then leverage a regular execute API with files, i.e. something like Statement.execute("COPY t from LOCAL '/my/named/pipe' REJECTED DATA AS TABLE t_rejects"). This gives you more control over the syntax, but you aren't allowed to use STDIN since there's no standard way to provide that with the JDBC interface. You have to be careful to avoid deadlocks when doing this -- the driver will read that pipe from the same thread that called execute(), so another thread has to be feeding the pipe data.
5
Answers
ThePebble, thank you for quick answer.
What do you mean?
Is it possible to upload input data via Copy protocol from Java-based application without VerticaCopyStream?
In fact, I don't really need any records in "rejected data" or "exceptions" files. What matters to me is the error message. Is there any way to get them when using VerticaCopyStream without parsing the exceptions file.
So COPY LOCAL has a few different mechanisms for handling rejection/exception data, and most of them are mutually exclusive to one another. VerticaCopyStream lets you run a COPY t FROM LOCAL STDIN while abstracting some details away, but that abstraction is a bit leaky unfortunately, which makes things a little confusing.
COPY t FROM LOCAL 'file.txt' REJECTED DATA 'rejections.txt' EXCEPTIONS 'exceptions.txt'
This writes the rejection/exceptions files to paths on the client.
COPY t FROM 'file.txt' REJECTED DATA 'rejections.txt' EXCEPTIONS 'exceptions.txt'
This writes the rejection/exceptions files to paths on the vertica nodes. You may need special storage location permissions to be able to write these files.
COPY t FROM [LOCAL] 'file.txt' REJECTED DATA AS TABLE t_rejects
This writes the rejection/exceptions data to a table called t_rejects. It isn't a true vertica table, it is actually an external table (https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/ExternalTables/CreatingExternalTables.htm). Under the covers it is still writing exceptions/rejection files similar to the previous case, except they are in a different format that is more amenable to being used in an external table copy statement definition. For a long time, the LOCAL variant of COPY did not support this method; support for that was added in 9.1.x.
When you use STDIN instead of a file path, the client application must provide a data feed to act as the STDIN. For vsql, we can literally read from standard input to provide that information. For a library like JDBC, we instead use VerticaCopyStream, a JDBC extension that lets you provide a java InputStream to act as stdin. JDBC also supports the standard batching APIs, when you write series of batched INSERT INTO t VALUES (?), the driver automatically converts it into an equivalent COPY t FROM LOCAL STDIN, where the ?'s are serialized into a data stream.
There is another option that clients use that is helpful for implementing client API's, but not very useful for applications: RETURNREJECTED
COPY t FROM LOCAL STDIN RETURNREJECTED
This returns changes the format of the rejection data returned to the client. Instead of the rejection/exception data, only the record numbers of the rows in the batch that failed to load are returned.
Why would you willingly give up that extra information? Because that is the only information we are capable of serving to applications in a standard format -- the return value of JDBC's PreparedStatement.executeBatch() is an int [] of status codes for each row in the batch. VerticaCopyStream reuses that same infrastructure in order to implement the getRejects() method. It will inject the RETURNJRECTED keyword if no exceptions/rejections clauses are found since that is better than nothing. This is why the javadocs state that this API returns no answers if you provide your own REJECTED DATA/EXCEPTIONS clauses.
Of course, this all means that VerticaCopyStream can only do client-side rejection/exception files or row numbers via getRejects(). Currently it is not aware that servers can do REJECTED DATA AS TABLE with the LOCAL keyword. The client supports all versions of Vertica since 7.1 and not all have that feature. It should be updated to become aware of that, but we haven't done so yet. I've opened a feature request ticket to do so.
So your best path forward is probably one of the following options:
Thank you so much. I understand the situation and the possibilities that the driver