Vertica S3Export - Data Quality issues with 'PARTITION' clause
We are trying to export the data of few huge Vertica tables into files on AWS S3.
S3Export with Partition (BEST / or any other column) turned out to export the files successfully. Though it isn't super fast, considering the volume of data, we felt ok with the performance. However, we realized the files had data quality issues, specifically with 'date' columns. We noticed many records with incorrect values for date columns.
Post some self research, found that - ‘PARTITION BEST’ or Partition on columns isn’t advised to be used on UDTs that aren’t thread safe. 'PARTITION NODES' is the one to be used.
'PARTITION NODES' seem to export all the data accurately, however it only generated one file per node and took much longer time. This also resulted in huge sized files.
As we have further processing requirements (to COPY onto Redshift database) on these exported files, we need them to be much smaller.
Any suggestions on if we could get the S3Export provide the accurate data in multiple smaller sized files?
Best Answer
-
Jim_Knicely - Select Field - Administrator
There is a fix in Vertica 9.1.1-1 (Release on 08/27/2018). VER-63693 - S3. Before this release, S3Export was not thread safe when the data contains time/date values. This means S3Export should not be used with PARTITION BEST when exporting time/date values before this fix.
See:
https://www.vertica.com/docs/ReleaseNotes/9.1.x/Vertica_9.1.x_Release_Notes.htm5
Answers
@Srivalli - Which version of Vertica are you running?
@Srivalli Did you try the offset clause to select small number of rows? You can use it with S3Export.
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Statements/SELECT/OFFSETClause.htm?zoom_highlight=offset
@Jim_Knicely ,
We have Vertica 9.1.0-1
@SruthiA ,
The issue isn't just when exporting huge volumes of data.
I even tried exporting a small subset of data (6000 records) using (PARTITION BEST) and could still the issue with date type columns.
Note that I did give a try using OFFSET and LIMIT on the small subset of table, the results exported are still inaccurate.