Load data from spark dataframe to vertica from AWS Glue
We are using Vertica version 9.2.1. AWS Glue as ETL tool. Trying to load the data from pyspark data frame to Vertica. Getting below error
- An error occurred while calling o1907.save.
- java.lang.Exception: ERROR: S2V.save(): did not pass the Vertica requirements pre-check. The following problems were encountered: hdfs_url scheme should be 'hdfs', but user provided:null. hdfs_url path is not valid, user provided:. java.lang.IllegalArgumentException: Can not create a Path from an empty string
at com.vertica.spark.s2v.S2V.save(S2V.scala:491)
--data loading step
dataFrame4.write.save(format="com.vertica.spark.datasource.DefaultSource", mode="append", **opts)
opts={}
opts['dbschema'] = 'staging'
opts['table'] = 'fact_rating_aggregate_stage'
opts['db']='*'
opts['user']='etluser'
opts['password']='****'
opts['host']='***'
Glue service is serverless and brings servers on the fly and processes data. I have not given the hdfs_url as this is keep changing. Please help
Tagged:
0
Answers
Vertica uses the specified HDFS URL to write ORC files that are ingested into Vertica.
Here is a workaround: write the data frames in ORC or Parquet format on S3, then use the vertica-python client (https://github.com/vertica/vertica-python) to run SQL commands to COPY the temporary files from S3 into the target table.
I would recommend opening a support ticket so we can work on other options for a future patch (e.g. writing temp files to S3, or streaming directly to Vertica)
I am getting below error in while connecting to Vertica. Though connecting through same code works fine in my local
File "/tmp/glue-python-scripts-51dayeq0/vertica_python_poc.py", line 1, in
File "/glue/lib/installation/vertica_python-0.9.3-py3.6.egg/vertica_python/init.py", line 38, in
from .vertica.connection import Connection, connect
File "/glue/lib/installation/vertica_python-0.9.3-py3.6.egg/vertica_python/vertica/connection.py", line 60, in
DEFAULT_USER = getpass.getuser()
File "/usr/local/lib/python3.6/getpass.py", line 169, in getuser
return pwd.getpwuid(os.getuid())[0]
KeyError: 'getpwuid(): uid not found: 10000'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/runscript.py", line 134, in
raise e_type(e_value).with_tracsback(new_stack)
AttributeError: 'KeyError' object has no attribute 'with_tracsback'
I wouldn't expect getpass.getuser() to work in Lambda or Glue because there's probably not a real user associated with the process. I'll file a GitHub issue (#296) to add error checking here and set "dbadmin" as the default user on error.
You could fork the code and change vertica/connection.py to set DEFAULT_USER to "dbadmin" in the meantime.
Dont have that option as I can only provide egg file. it will be installed on the fly.
Clarifying the above from GitHub discussion, DEFAULT_USER should be set to blank or 'nobody' on error for security reasons.
Another workaround is to run a cron job to check the S3 bucket for waiting Parquet files and push them to Vertica. We expect this to be only temporary; the vertica-python package should be fixed soon, and I've put in an enhancement request to add S3 support to the Vertica Spark Connector package also.
for now I have modified the library to not to refer getuser() fucnction.
This issue should be fixed in python-vertica 0.9.4 which has just been published. Let us know if there is still an issue.
Checked against 0.9.4, error still exists.
(, KeyError('getpwuid(): uid not found: 10000',), )
Did you specify a 'user' in the connection options? Here is the code from the driver:
Could you post the full stack trace so we can see whether there is another instance of "getuser", or whether there is still an unhandled exception? Thanks...
Dev has reviewed and fixed this. Please update to the just released v0.9.5, which fixes this issue with AWS Glue serverless environment.