Please take this survey to help us learn more about how you use third party tools. Your input is greatly appreciated!

Corrupt PARQUET dump when dumping over 2^31 rows

Vertica v10.1.1-7 and v11.0.0-0 (these are the version I have tried) have a bug when exporting a parquet with more than 2^31/2^32 rows since it generates a num_rows that is invalid.

When I try to load the parquet file, Vertica reports that 0 rows were loaded.
If I inspect the parquet with parquet-tools, I get a negative number of rows, see https://github.com/ktrueda/parquet-tools/issues/18

Output for a v10.1.1-7 dump:

~/.local/bin/parquet-tools inspect ./dump-1.parquet 

############ file meta data ############
created_by: parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)
num_columns: 3
num_rows: -1803805480
num_row_groups: 1
format_version: 1.0
serialized_size: 329

Output for a v11.0.0-0 dump:

~/.local/bin/parquet-tools inspect ./dump-2parquet 

############ file meta data ############
created_by: parquet-cpp version 1.0.0-v11.0.0 (build Vertica Analytic Database)
num_columns: 3
num_rows: -1766273362
num_row_groups: 1
format_version: 1.0
serialized_size: 328

Seems like you're using a really old parquet-cpp, maybe bump to a newer version? https://github.com/apache/parquet-cpp/releases

Tagged:

Comments

  • SruthiASruthiA Employee

    @nicolaerosia : Thank you for the update. Did you try to run export to parquet to HDFS or local? Could you please share your export to parquet SQL statement?

  • Hello, sorry for the delay.
    I ran the export to local, e.g. EXPORT TO PARQUET (directory='/mnt/aaa', compression='ZSTD') AS SELECT * FROM mytable;
    Any feedback?

  • @nicolaerosia : I tried to run the test. It created 3 parquet files for me. Vertica provides an inbuilt function called GET_METADATA and all the 3 files are showing positive row counts. I was able copy data back to new table. However copied row count is not same as the source table. I will check on that part.

    dbadmin=> select version();

    version

    Vertica Analytic Database v10.1.1-6
    (1 row)

    dbadmin=>

    dbadmin=> EXPORT TO PARQUET (directory='/home/dbadmin/dumpdata', compression='ZSTD') AS SELECT * FROM test_export;

    Rows Exported

    4294971392
    

    (1 row)

    cd dumpdata/
    [[email protected] dumpdata]$ ls
    73df61a8-v_perf_gooddata_node0001-140694953912064-0.parquet 8099ace5-v_perf_gooddata_node0001-140693523654400-0.parquet d48d831d-v_perf_gooddata_node0001-140688549205760-0.parquet

    dbadmin=> SELECT GET_METADATA('/home/dbadmin/dumpdata/73df61a8-v_perf_gooddata_node0001-140694953912064-0.parquet');

    GET_METADATA

    schema:
    message schema {
    optional int64 i (Int(bitWidth=64, isSigned=true));
    }

    metadata:
    {
    "FileName": "/home/dbadmin/dumpdata/73df61a8-v_perf_gooddata_node0001-140694953912064-0.parquet",
    "FileFormat": "Parquet",
    "Version": "1.0",
    "CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
    "TotalRows": "474794880",
    "NumberOfRowGroups": "1",
    "NumberOfRealColumns": "1",
    "NumberOfColumns": "1",
    "Columns": [
    { "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
    ],
    "RowGroups": [
    {
    "Id": "0", "TotalBytes": "4633135", "Rows": "474794880",
    "ColumnChunks": [
    {"Id": "0", "Values": "474794880", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1048576", "Min": "24" },
    "Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "3797901894", "CompressedSize": "4633135" }
    ]
    }
    ]
    }

    (1 row)

    dbadmin=> SELECT GET_METADATA('/home/dbadmin/dumpdata/8099ace5-v_perf_gooddata_node0001-140693523654400-0.parquet');

    GET_METADATA

    schema:
    message schema {
    optional int64 i (Int(bitWidth=64, isSigned=true));
    }

    metadata:
    {
    "FileName": "/home/dbadmin/dumpdata/8099ace5-v_perf_gooddata_node0001-140693523654400-0.parquet",
    "FileFormat": "Parquet",
    "Version": "1.0",
    "CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
    "TotalRows": "480386817",
    "NumberOfRowGroups": "1",
    "NumberOfRealColumns": "1",
    "NumberOfColumns": "1",
    "Columns": [
    { "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
    ],
    "RowGroups": [
    {
    "Id": "0", "TotalBytes": "4667222", "Rows": "480386817",
    "ColumnChunks": [
    {"Id": "0", "Values": "480386817", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1048576", "Min": "0" },
    "Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "3842589855", "CompressedSize": "4667222" }
    ]
    }
    ]
    }

    (1 row)

    dbadmin=> SELECT GET_METADATA('/home/dbadmin/dumpdata/d48d831d-v_perf_gooddata_node0001-140688549205760-0.parquet');

    GET_METADATA

    schema:
    message schema {
    optional int64 i (Int(bitWidth=64, isSigned=true));
    }

    metadata:
    {
    "FileName": "/home/dbadmin/dumpdata/d48d831d-v_perf_gooddata_node0001-140688549205760-0.parquet",
    "FileFormat": "Parquet",
    "Version": "1.0",
    "CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
    "TotalRows": "474813823",
    "NumberOfRowGroups": "1",
    "NumberOfRealColumns": "1",
    "NumberOfColumns": "1",
    "Columns": [
    { "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
    ],
    "RowGroups": [
    {
    "Id": "0", "TotalBytes": "4554914", "Rows": "474813823",
    "ColumnChunks": [
    {"Id": "0", "Values": "474813823", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1048576", "Min": "18" },
    "Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "3798335893", "CompressedSize": "4554914" }
    ]
    }
    ]
    }

    dbadmin=> create table test_export_load(i int);
    CREATE TABLE

    dbadmin=> copy test_export_load from '/home/dbadmin/dumpdata/*' parquet;

    Rows Loaded

    1429995520
    (1 row)

    dbadmin=>

  • If I run the test on a single node cluster, then I can see the negative row count in the parquet file. Are you running it on a single node cluster?

    dbadmin=> select get_metadata('/home/dbadmin/parquet_export/3a7ffa6d-v_testparexport_node0001-139764795692800.parquet' );

    get_metadata

    schema:
    message schema {
    optional int64 i (Int(bitWidth=64, isSigned=true));
    }

    metadata:
    {
    "FileName": "/home/dbadmin/parquet_export/3a7ffa6d-v_testparexport_node0001-139764795692800.parquet",
    "FileFormat": "Parquet",
    "Version": "1.0",
    "CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
    "TotalRows": "-2147483648",
    "NumberOfRowGroups": "1",
    "NumberOfRealColumns": "1",
    "NumberOfColumns": "1",
    "Columns": [
    { "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
    ],
    "RowGroups": [
    {
    "Id": "0", "TotalBytes": "26859100", "Rows": "-2147483648",
    "ColumnChunks": [
    {"Id": "0", "Values": "2147483648", "StatsSet": "False",
    "Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "17181377050", "CompressedSize": "26859100" }
    ]
    }
    ]
    }

    (1 row)

  • @nicolaerosia : I have raised a bug with engineering. VER-78969 is the bug number. If you need more information regarding it, please raise a support case so that I can provide more details

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file

Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.