Corrupt PARQUET dump when dumping over 2^31 rows
nicolaerosia
Vertica Customer
Vertica v10.1.1-7 and v11.0.0-0 (these are the version I have tried) have a bug when exporting a parquet with more than 2^31/2^32 rows since it generates a num_rows that is invalid.
When I try to load the parquet file, Vertica reports that 0 rows were loaded.
If I inspect the parquet with parquet-tools, I get a negative number of rows, see https://github.com/ktrueda/parquet-tools/issues/18
Output for a v10.1.1-7 dump:
~/.local/bin/parquet-tools inspect ./dump-1.parquet ############ file meta data ############ created_by: parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database) num_columns: 3 num_rows: -1803805480 num_row_groups: 1 format_version: 1.0 serialized_size: 329
Output for a v11.0.0-0 dump:
~/.local/bin/parquet-tools inspect ./dump-2parquet ############ file meta data ############ created_by: parquet-cpp version 1.0.0-v11.0.0 (build Vertica Analytic Database) num_columns: 3 num_rows: -1766273362 num_row_groups: 1 format_version: 1.0 serialized_size: 328
Seems like you're using a really old parquet-cpp, maybe bump to a newer version? https://github.com/apache/parquet-cpp/releases
Tagged:
0
Comments
@nicolaerosia : Thank you for the update. Did you try to run export to parquet to HDFS or local? Could you please share your export to parquet SQL statement?
Hello, sorry for the delay.
I ran the export to local, e.g. EXPORT TO PARQUET (directory='/mnt/aaa', compression='ZSTD') AS SELECT * FROM mytable;
Any feedback?
@nicolaerosia : I tried to run the test. It created 3 parquet files for me. Vertica provides an inbuilt function called GET_METADATA and all the 3 files are showing positive row counts. I was able copy data back to new table. However copied row count is not same as the source table. I will check on that part.
dbadmin=> select version();
version
Vertica Analytic Database v10.1.1-6
(1 row)
dbadmin=>
dbadmin=> EXPORT TO PARQUET (directory='/home/dbadmin/dumpdata', compression='ZSTD') AS SELECT * FROM test_export;
Rows Exported
(1 row)
cd dumpdata/
[dbadmin@anamula1 dumpdata]$ ls
73df61a8-v_perf_gooddata_node0001-140694953912064-0.parquet 8099ace5-v_perf_gooddata_node0001-140693523654400-0.parquet d48d831d-v_perf_gooddata_node0001-140688549205760-0.parquet
dbadmin=> SELECT GET_METADATA('/home/dbadmin/dumpdata/73df61a8-v_perf_gooddata_node0001-140694953912064-0.parquet');
GET_METADATA
schema:
message schema {
optional int64 i (Int(bitWidth=64, isSigned=true));
}
metadata:
{
"FileName": "/home/dbadmin/dumpdata/73df61a8-v_perf_gooddata_node0001-140694953912064-0.parquet",
"FileFormat": "Parquet",
"Version": "1.0",
"CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
"TotalRows": "474794880",
"NumberOfRowGroups": "1",
"NumberOfRealColumns": "1",
"NumberOfColumns": "1",
"Columns": [
{ "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
],
"RowGroups": [
{
"Id": "0", "TotalBytes": "4633135", "Rows": "474794880",
"ColumnChunks": [
{"Id": "0", "Values": "474794880", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1048576", "Min": "24" },
"Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "3797901894", "CompressedSize": "4633135" }
]
}
]
}
(1 row)
dbadmin=> SELECT GET_METADATA('/home/dbadmin/dumpdata/8099ace5-v_perf_gooddata_node0001-140693523654400-0.parquet');
GET_METADATA
schema:
message schema {
optional int64 i (Int(bitWidth=64, isSigned=true));
}
metadata:
{
"FileName": "/home/dbadmin/dumpdata/8099ace5-v_perf_gooddata_node0001-140693523654400-0.parquet",
"FileFormat": "Parquet",
"Version": "1.0",
"CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
"TotalRows": "480386817",
"NumberOfRowGroups": "1",
"NumberOfRealColumns": "1",
"NumberOfColumns": "1",
"Columns": [
{ "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
],
"RowGroups": [
{
"Id": "0", "TotalBytes": "4667222", "Rows": "480386817",
"ColumnChunks": [
{"Id": "0", "Values": "480386817", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1048576", "Min": "0" },
"Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "3842589855", "CompressedSize": "4667222" }
]
}
]
}
(1 row)
dbadmin=> SELECT GET_METADATA('/home/dbadmin/dumpdata/d48d831d-v_perf_gooddata_node0001-140688549205760-0.parquet');
GET_METADATA
schema:
message schema {
optional int64 i (Int(bitWidth=64, isSigned=true));
}
metadata:
{
"FileName": "/home/dbadmin/dumpdata/d48d831d-v_perf_gooddata_node0001-140688549205760-0.parquet",
"FileFormat": "Parquet",
"Version": "1.0",
"CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
"TotalRows": "474813823",
"NumberOfRowGroups": "1",
"NumberOfRealColumns": "1",
"NumberOfColumns": "1",
"Columns": [
{ "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
],
"RowGroups": [
{
"Id": "0", "TotalBytes": "4554914", "Rows": "474813823",
"ColumnChunks": [
{"Id": "0", "Values": "474813823", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1048576", "Min": "18" },
"Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "3798335893", "CompressedSize": "4554914" }
]
}
]
}
dbadmin=> create table test_export_load(i int);
CREATE TABLE
dbadmin=> copy test_export_load from '/home/dbadmin/dumpdata/*' parquet;
Rows Loaded
1429995520
(1 row)
dbadmin=>
If I run the test on a single node cluster, then I can see the negative row count in the parquet file. Are you running it on a single node cluster?
dbadmin=> select get_metadata('/home/dbadmin/parquet_export/3a7ffa6d-v_testparexport_node0001-139764795692800.parquet' );
get_metadata
schema:
message schema {
optional int64 i (Int(bitWidth=64, isSigned=true));
}
metadata:
{
"FileName": "/home/dbadmin/parquet_export/3a7ffa6d-v_testparexport_node0001-139764795692800.parquet",
"FileFormat": "Parquet",
"Version": "1.0",
"CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
"TotalRows": "-2147483648",
"NumberOfRowGroups": "1",
"NumberOfRealColumns": "1",
"NumberOfColumns": "1",
"Columns": [
{ "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
],
"RowGroups": [
{
"Id": "0", "TotalBytes": "26859100", "Rows": "-2147483648",
"ColumnChunks": [
{"Id": "0", "Values": "2147483648", "StatsSet": "False",
"Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "17181377050", "CompressedSize": "26859100" }
]
}
]
}
(1 row)
@nicolaerosia : I have raised a bug with engineering. VER-78969 is the bug number. If you need more information regarding it, please raise a support case so that I can provide more details
Yes, I'm running a single node vertica, thank you.
@SruthiA I see VER-78969 is mentioned as fixed in 11.0.1 - I will give it a try!
@nicolaerosia : Thank you for reporting the bug. Yes. it has been fixed in 11.0.1 and is being backported to 10.1.1. I thought of updating here once 10.1.1 hotfix is released since you mentioned 10.1.1 and 11.0 in the description of the issue.
@nicolaerosia : The fix has been backported to 10.1.1 and it has been released.
https://www.vertica.com/docs/ReleaseNotes/10.1.x/Vertica_10.1.x_Release_Notes.htm#10.1.1-10
@SruthiA thank you