Newbie: Loading Parquet file from S3 with correct column count match to table, fails:Too few Columns

unimagine · December 2018

Hi all,
I'm new to Vertica, so apologies if this is a standard question. I've done due diligence and trawled the net, and this forum (and docs) and still don't have a solution.

I have a parquet file which i have loaded into s3. It has 42 columns. I have a table which I have carefully created with... 42 columns. I am loading this test parquet file of 42 columns into my table of 42 columns with the following command:

COPY . FROM 's3:///VERTICA-ELT/data_so_2016-01-01_2018-12-09_text.parquet' PARQUET

When I run, I have 'success', but the table is not loaded (no records)

When I check rejections with the following query:

SELECT * FROM .s3_load_rejections_data_so_raw

I get a 'rejected_reason' of 'Too few columns found'.

If it was a csv I would expect this to be a pro

Jim_Knicely · December 2018

Hi,

I use a MS Windows tool called ParquetViewer to view & query Apache Parquet files when I am having issues loading them into Vertica. It will show you the data along with the columns names so that you can verify that the file indeed has 42 columns.

See:
https://github.com/mukunku/ParquetViewer

Is that the full COPY command that you executed? I don't see a REJECTED DATA TABLE clause so not sure how you were able to get the s3_load_rejections_data_so_raw table.

Typically,when loading a parquet file with less columns than the table, Vertica will just insert NULLs into the trailing columns...

Example:

dbadmin=> CREATE TABLE test_parq (c1 INT, c2 INT);
CREATE TABLE

dbadmin=> INSERT INTO test_parq SELECT 1, 1;
 OUTPUT
--------
      1
(1 row)

dbadmin=> INSERT INTO test_parq SELECT 3, 2;
 OUTPUT
--------
      1
(1 row)

dbadmin=> EXPORT TO PARQUET (DIRECTORY='/home/dbadmin/test_parq_out') AS SELECT * FROM test_parq;
 Rows Exported
---------------
             2
(1 row)

dbadmin=> CREATE TABLE test_parq2 (c1 INT, c2 INT, c3 INT, c4 INT);
CREATE TABLE

dbadmin=> COPY test_parq2 FROM '/home/dbadmin/test_parq_out/*.parquet' PARQUET REJECTED DATA TABLE test_parq2_bad;
 Rows Loaded
-------------
           2
(1 row)

dbadmin=> SELECT * FROM test_parq2;
 c1 | c2 | c3 | c4
----+----+----+----      1 |  1 |    |
  3 |  2 |    |
(2 rows)

unimagine · December 2018

Hi Jim, yes I found the same Parquet visualiser. although there are some blank / null fields within the parquet file, the number of columns is correct. The parquet file is produced as a part of a data flow in another tool, but I validated the output using the parquet visualiser before loading into vertica. The column count was the expected 42 columns. Indicentally, if I produce an alternative csv format output from the external data flow (which has 42 columns), I can load the table without problems. My first instinct is that if this were a csv file, I would look for an issue with delimiters but with the parquet format I'm not sure how it works enough to make a modification

Jim_Knicely · December 2018

@unimagine - This might sound dumb, but I would try creating a table with with less columns and see if I could load the file.

I would use all VARCHAR columns to be sure I don't have to worry about data types. So, I'd create a table with 41 VARCHAR columns, than 42 VARCHAR columns, etc. until I can load the file.

Once loaded I'd look for what is missing from that CSV file you loaded.

Could be that the PARQUET parser is not reading your file correctly, and we'll want to fix that!

If the data is not sensitive in your parquet file, maybe you could send it to me and I'll see what I can do to get it loaded.

Send me an email @ james.knicely@microfocus,com

You can attach the file, or I can provide an FTP upload site for you.

We're Moving!

Create My New Community Account Now

Newbie: Loading Parquet file from S3 with correct column count match to table, fails:Too few Columns

Comments

Leave a Comment