We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Corrupt PARQUET dump when dumping over 2^31 rows — Vertica Forum

Corrupt PARQUET dump when dumping over 2^31 rows

Vertica v10.1.1-7 and v11.0.0-0 (these are the version I have tried) have a bug when exporting a parquet with more than 2^31/2^32 rows since it generates a num_rows that is invalid.

When I try to load the parquet file, Vertica reports that 0 rows were loaded.
If I inspect the parquet with parquet-tools, I get a negative number of rows, see https://github.com/ktrueda/parquet-tools/issues/18

Output for a v10.1.1-7 dump:

~/.local/bin/parquet-tools inspect ./dump-1.parquet 

############ file meta data ############
created_by: parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)
num_columns: 3
num_rows: -1803805480
num_row_groups: 1
format_version: 1.0
serialized_size: 329

Output for a v11.0.0-0 dump:

~/.local/bin/parquet-tools inspect ./dump-2parquet 

############ file meta data ############
created_by: parquet-cpp version 1.0.0-v11.0.0 (build Vertica Analytic Database)
num_columns: 3
num_rows: -1766273362
num_row_groups: 1
format_version: 1.0
serialized_size: 328

Seems like you're using a really old parquet-cpp, maybe bump to a newer version? https://github.com/apache/parquet-cpp/releases

Tagged:

Comments

  • SruthiASruthiA Administrator

    @nicolaerosia : Thank you for the update. Did you try to run export to parquet to HDFS or local? Could you please share your export to parquet SQL statement?

  • nicolaerosianicolaerosia Vertica Customer

    Hello, sorry for the delay.
    I ran the export to local, e.g. EXPORT TO PARQUET (directory='/mnt/aaa', compression='ZSTD') AS SELECT * FROM mytable;
    Any feedback?

  • SruthiASruthiA Administrator

    @nicolaerosia : I tried to run the test. It created 3 parquet files for me. Vertica provides an inbuilt function called GET_METADATA and all the 3 files are showing positive row counts. I was able copy data back to new table. However copied row count is not same as the source table. I will check on that part.

    dbadmin=> select version();

    version

    Vertica Analytic Database v10.1.1-6
    (1 row)

    dbadmin=>

    dbadmin=> EXPORT TO PARQUET (directory='/home/dbadmin/dumpdata', compression='ZSTD') AS SELECT * FROM test_export;

    Rows Exported

    4294971392
    

    (1 row)

    cd dumpdata/
    [dbadmin@anamula1 dumpdata]$ ls
    73df61a8-v_perf_gooddata_node0001-140694953912064-0.parquet 8099ace5-v_perf_gooddata_node0001-140693523654400-0.parquet d48d831d-v_perf_gooddata_node0001-140688549205760-0.parquet

    dbadmin=> SELECT GET_METADATA('/home/dbadmin/dumpdata/73df61a8-v_perf_gooddata_node0001-140694953912064-0.parquet');

    GET_METADATA

    schema:
    message schema {
    optional int64 i (Int(bitWidth=64, isSigned=true));
    }

    metadata:
    {
    "FileName": "/home/dbadmin/dumpdata/73df61a8-v_perf_gooddata_node0001-140694953912064-0.parquet",
    "FileFormat": "Parquet",
    "Version": "1.0",
    "CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
    "TotalRows": "474794880",
    "NumberOfRowGroups": "1",
    "NumberOfRealColumns": "1",
    "NumberOfColumns": "1",
    "Columns": [
    { "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
    ],
    "RowGroups": [
    {
    "Id": "0", "TotalBytes": "4633135", "Rows": "474794880",
    "ColumnChunks": [
    {"Id": "0", "Values": "474794880", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1048576", "Min": "24" },
    "Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "3797901894", "CompressedSize": "4633135" }
    ]
    }
    ]
    }

    (1 row)

    dbadmin=> SELECT GET_METADATA('/home/dbadmin/dumpdata/8099ace5-v_perf_gooddata_node0001-140693523654400-0.parquet');

    GET_METADATA

    schema:
    message schema {
    optional int64 i (Int(bitWidth=64, isSigned=true));
    }

    metadata:
    {
    "FileName": "/home/dbadmin/dumpdata/8099ace5-v_perf_gooddata_node0001-140693523654400-0.parquet",
    "FileFormat": "Parquet",
    "Version": "1.0",
    "CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
    "TotalRows": "480386817",
    "NumberOfRowGroups": "1",
    "NumberOfRealColumns": "1",
    "NumberOfColumns": "1",
    "Columns": [
    { "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
    ],
    "RowGroups": [
    {
    "Id": "0", "TotalBytes": "4667222", "Rows": "480386817",
    "ColumnChunks": [
    {"Id": "0", "Values": "480386817", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1048576", "Min": "0" },
    "Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "3842589855", "CompressedSize": "4667222" }
    ]
    }
    ]
    }

    (1 row)

    dbadmin=> SELECT GET_METADATA('/home/dbadmin/dumpdata/d48d831d-v_perf_gooddata_node0001-140688549205760-0.parquet');

    GET_METADATA

    schema:
    message schema {
    optional int64 i (Int(bitWidth=64, isSigned=true));
    }

    metadata:
    {
    "FileName": "/home/dbadmin/dumpdata/d48d831d-v_perf_gooddata_node0001-140688549205760-0.parquet",
    "FileFormat": "Parquet",
    "Version": "1.0",
    "CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
    "TotalRows": "474813823",
    "NumberOfRowGroups": "1",
    "NumberOfRealColumns": "1",
    "NumberOfColumns": "1",
    "Columns": [
    { "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
    ],
    "RowGroups": [
    {
    "Id": "0", "TotalBytes": "4554914", "Rows": "474813823",
    "ColumnChunks": [
    {"Id": "0", "Values": "474813823", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1048576", "Min": "18" },
    "Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "3798335893", "CompressedSize": "4554914" }
    ]
    }
    ]
    }

    dbadmin=> create table test_export_load(i int);
    CREATE TABLE

    dbadmin=> copy test_export_load from '/home/dbadmin/dumpdata/*' parquet;

    Rows Loaded

    1429995520
    (1 row)

    dbadmin=>

  • SruthiASruthiA Administrator

    If I run the test on a single node cluster, then I can see the negative row count in the parquet file. Are you running it on a single node cluster?

    dbadmin=> select get_metadata('/home/dbadmin/parquet_export/3a7ffa6d-v_testparexport_node0001-139764795692800.parquet' );

    get_metadata

    schema:
    message schema {
    optional int64 i (Int(bitWidth=64, isSigned=true));
    }

    metadata:
    {
    "FileName": "/home/dbadmin/parquet_export/3a7ffa6d-v_testparexport_node0001-139764795692800.parquet",
    "FileFormat": "Parquet",
    "Version": "1.0",
    "CreatedBy": "parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)",
    "TotalRows": "-2147483648",
    "NumberOfRowGroups": "1",
    "NumberOfRealColumns": "1",
    "NumberOfColumns": "1",
    "Columns": [
    { "Id": "0", "Name": "i", "PhysicalType": "INT64", "ConvertedType": "INT_64", "LogicalType": {"Type": "Int", "bitWidth": 64, "isSigned": true} }
    ],
    "RowGroups": [
    {
    "Id": "0", "TotalBytes": "26859100", "Rows": "-2147483648",
    "ColumnChunks": [
    {"Id": "0", "Values": "2147483648", "StatsSet": "False",
    "Compression": "ZSTD", "Encodings": "PLAIN_DICTIONARY PLAIN RLE PLAIN ", "UncompressedSize": "17181377050", "CompressedSize": "26859100" }
    ]
    }
    ]
    }

    (1 row)

  • SruthiASruthiA Administrator

    @nicolaerosia : I have raised a bug with engineering. VER-78969 is the bug number. If you need more information regarding it, please raise a support case so that I can provide more details

  • nicolaerosianicolaerosia Vertica Customer

    Yes, I'm running a single node vertica, thank you.

  • nicolaerosianicolaerosia Vertica Customer

    @SruthiA I see VER-78969 is mentioned as fixed in 11.0.1 - I will give it a try!

  • SruthiASruthiA Administrator

    @nicolaerosia : Thank you for reporting the bug. Yes. it has been fixed in 11.0.1 and is being backported to 10.1.1. I thought of updating here once 10.1.1 hotfix is released since you mentioned 10.1.1 and 11.0 in the description of the issue.

  • SruthiASruthiA Administrator
  • nicolaerosianicolaerosia Vertica Customer

    @SruthiA thank you

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file