Vertica S3Export - Data Quality issues with 'PARTITION' clause
We are trying to export the data of few huge Vertica tables into files on AWS S3.
S3Export with Partition (BEST / or any other column) turned out to export the files successfully. Though it isn't super fast, considering the volume of data, we felt ok with the performance. However, we realized the files had data quality issues, specifically with 'date' columns. We noticed many records with incorrect values for date columns.
Post some self research, found that - ‘PARTITION BEST’ or Partition on columns isn’t advised to be used on UDTs that aren’t thread safe. 'PARTITION NODES' is the one to be used.
'PARTITION NODES' seem to export all the data accurately, however it only generated one file per node and took much longer time. This also resulted in huge sized files.
As we have further processing requirements (to COPY onto Redshift database) on these exported files, we need them to be much smaller.
Any suggestions on if we could get the S3Export provide the accurate data in multiple smaller sized files?