Load data from spark dataframe to vertica from AWS Glue

pg_operative · July 2019

We are using Vertica version 9.2.1. AWS Glue as ETL tool. Trying to load the data from pyspark data frame to Vertica. Getting below error

An error occurred while calling o1907.save.: java.lang.Exception: ERROR: S2V.save(): did not pass the Vertica requirements pre-check. The following problems were encountered: hdfs_url scheme should be 'hdfs', but user provided:null. hdfs_url path is not valid, user provided:. java.lang.IllegalArgumentException: Can not create a Path from an empty string
at com.vertica.spark.s2v.S2V.save(S2V.scala:491)

--data loading step
dataFrame4.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="append", **opts)

opts={}
opts['dbschema'] = 'staging'
opts['table'] = 'fact_rating_aggregate_stage'
opts['db']='*'
opts['user']='etluser'
opts['password']='****'
opts['host']='***'

Glue service is serverless and brings servers on the fly and processes data. I have not given the hdfs_url as this is keep changing. Please help

Bryan_H · July 2019

Vertica uses the specified HDFS URL to write ORC files that are ingested into Vertica.
Here is a workaround: write the data frames in ORC or Parquet format on S3, then use the vertica-python client (https://github.com/vertica/vertica-python) to run SQL commands to COPY the temporary files from S3 into the target table.
I would recommend opening a support ticket so we can work on other options for a future patch (e.g. writing temp files to S3, or streaming directly to Vertica)

pg_operative · July 2019

I am getting below error in while connecting to Vertica. Though connecting through same code works fine in my local

File "/tmp/glue-python-scripts-51dayeq0/vertica_python_poc.py", line 1, in
File "/glue/lib/installation/vertica_python-0.9.3-py3.6.egg/vertica_python/init.py", line 38, in
from .vertica.connection import Connection, connect
File "/glue/lib/installation/vertica_python-0.9.3-py3.6.egg/vertica_python/vertica/connection.py", line 60, in
DEFAULT_USER = getpass.getuser()
File "/usr/local/lib/python3.6/getpass.py", line 169, in getuser
return pwd.getpwuid(os.getuid())[0]
KeyError: 'getpwuid(): uid not found: 10000'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/tmp/runscript.py", line 134, in
raise e_type(e_value).with_tracsback(new_stack)
AttributeError: 'KeyError' object has no attribute 'with_tracsback'

Bryan_H · July 2019

I wouldn't expect getpass.getuser() to work in Lambda or Glue because there's probably not a real user associated with the process. I'll file a GitHub issue (#296) to add error checking here and set "dbadmin" as the default user on error.
You could fork the code and change vertica/connection.py to set DEFAULT_USER to "dbadmin" in the meantime.

pg_operative · July 2019

Dont have that option as I can only provide egg file. it will be installed on the fly.

Bryan_H · July 2019

Clarifying the above from GitHub discussion, DEFAULT_USER should be set to blank or 'nobody' on error for security reasons.

Bryan_H · July 2019

Another workaround is to run a cron job to check the S3 bucket for waiting Parquet files and push them to Vertica. We expect this to be only temporary; the vertica-python package should be fixed soon, and I've put in an enhancement request to add S3 support to the Vertica Spark Connector package also.

pg_operative · July 2019

for now I have modified the library to not to refer getuser() fucnction.

Bryan_H · July 2019

This issue should be fixed in python-vertica 0.9.4 which has just been published. Let us know if there is still an issue.

pg_operative · August 2019

Checked against 0.9.4, error still exists.
(, KeyError('getpwuid(): uid not found: 10000',), )

Bryan_H · August 2019

Did you specify a 'user' in the connection options? Here is the code from the driver:

    if 'user' not in self.options:
        try:
            self.options['user'] = getpass.getuser()
        except Exception as e:
            self._logger.error(
                "Failed to set default value for connection 'user': {}".format(str(e)))
            raise KeyError('Connection option "user" is required')

Could you post the full stack trace so we can see whether there is another instance of "getuser", or whether there is still an unhandled exception? Thanks...

Bryan_H · August 2019

Dev has reviewed and fixed this. Please update to the just released v0.9.5, which fixes this issue with AWS Glue serverless environment.

We're Moving!

Create My New Community Account Now

Load data from spark dataframe to vertica from AWS Glue

Answers

Leave a Comment