Load data from spark dataframe to vertica from AWS Glue

We are using Vertica version 9.2.1. AWS Glue as ETL tool. Trying to load the data from pyspark data frame to Vertica. Getting below error

An error occurred while calling o1907.save.
java.lang.Exception: ERROR: S2V.save(): did not pass the Vertica requirements pre-check. The following problems were encountered: hdfs_url scheme should be 'hdfs', but user provided:null. hdfs_url path is not valid, user provided:. java.lang.IllegalArgumentException: Can not create a Path from an empty string
at com.vertica.spark.s2v.S2V.save(S2V.scala:491)

--data loading step
dataFrame4.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="append", **opts)

opts={}
opts['dbschema'] = 'staging'
opts['table'] = 'fact_rating_aggregate_stage'
opts['db']='*'
opts['user']='etluser'
opts['password']='****'
opts['host']='***'

Glue service is serverless and brings servers on the fly and processes data. I have not given the hdfs_url as this is keep changing. Please help

Tagged:

Answers

  • Bryan_HBryan_H Vertica Employee Administrator

    Vertica uses the specified HDFS URL to write ORC files that are ingested into Vertica.
    Here is a workaround: write the data frames in ORC or Parquet format on S3, then use the vertica-python client (https://github.com/vertica/vertica-python) to run SQL commands to COPY the temporary files from S3 into the target table.
    I would recommend opening a support ticket so we can work on other options for a future patch (e.g. writing temp files to S3, or streaming directly to Vertica)

  • I am getting below error in while connecting to Vertica. Though connecting through same code works fine in my local

    File "/tmp/glue-python-scripts-51dayeq0/vertica_python_poc.py", line 1, in
    File "/glue/lib/installation/vertica_python-0.9.3-py3.6.egg/vertica_python/init.py", line 38, in
    from .vertica.connection import Connection, connect
    File "/glue/lib/installation/vertica_python-0.9.3-py3.6.egg/vertica_python/vertica/connection.py", line 60, in
    DEFAULT_USER = getpass.getuser()
    File "/usr/local/lib/python3.6/getpass.py", line 169, in getuser
    return pwd.getpwuid(os.getuid())[0]
    KeyError: 'getpwuid(): uid not found: 10000'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "/tmp/runscript.py", line 134, in
    raise e_type(e_value).with_tracsback(new_stack)
    AttributeError: 'KeyError' object has no attribute 'with_tracsback'

  • Bryan_HBryan_H Vertica Employee Administrator

    I wouldn't expect getpass.getuser() to work in Lambda or Glue because there's probably not a real user associated with the process. I'll file a GitHub issue (#296) to add error checking here and set "dbadmin" as the default user on error.
    You could fork the code and change vertica/connection.py to set DEFAULT_USER to "dbadmin" in the meantime.

  • Dont have that option as I can only provide egg file. it will be installed on the fly.

  • Bryan_HBryan_H Vertica Employee Administrator

    Clarifying the above from GitHub discussion, DEFAULT_USER should be set to blank or 'nobody' on error for security reasons.

  • Bryan_HBryan_H Vertica Employee Administrator

    Another workaround is to run a cron job to check the S3 bucket for waiting Parquet files and push them to Vertica. We expect this to be only temporary; the vertica-python package should be fixed soon, and I've put in an enhancement request to add S3 support to the Vertica Spark Connector package also.

  • for now I have modified the library to not to refer getuser() fucnction.

  • Bryan_HBryan_H Vertica Employee Administrator

    This issue should be fixed in python-vertica 0.9.4 which has just been published. Let us know if there is still an issue.

  • Checked against 0.9.4, error still exists.
    (, KeyError('getpwuid(): uid not found: 10000',), )

  • Bryan_HBryan_H Vertica Employee Administrator

    Did you specify a 'user' in the connection options? Here is the code from the driver:

        if 'user' not in self.options:
            try:
                self.options['user'] = getpass.getuser()
            except Exception as e:
                self._logger.error(
                    "Failed to set default value for connection 'user': {}".format(str(e)))
                raise KeyError('Connection option "user" is required')
    

    Could you post the full stack trace so we can see whether there is another instance of "getuser", or whether there is still an unhandled exception? Thanks...

  • Bryan_HBryan_H Vertica Employee Administrator

    Dev has reviewed and fixed this. Please update to the just released v0.9.5, which fixes this issue with AWS Glue serverless environment.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file