VerticaPy and Vdataframe question

kfruchtman · May 2021

Hi Experts!
After finding out about VerticaPy I have imported it to my python code.and I want to create Vdataframes and insert them back into the vertica db.
My workflow:
--connect to ftp to download csv files.
--insert csv files into dataframes - prefered Vdataframe ***** that is my questions****
--manipulate the data as I wish
--insert into the tables in vertica db. with Vdataframe.to_db

I have been looking for a way to insert csv file to Vdataframe .It can only be inserted with read_csv of pandas and not of the verticapy so after that I have a pandas dataframe and not a Vdataframe. After it is inserted to a pandas regular dataframe I cannot work with it with the Vdataframe functions..Using it as a Vdataframe would make it much easier and logical since working with Verticapy is much easier when the db is Vertica and this api has amazing capabilities to work together and make things work easy and fast.
E.g -
In my python code
new_auto_connection({"host": "x.x.x.x",
"port": "5433",
"database": "mydb",
"password": "",
"user": "dbadmin"},
name = "VerticaDSN")
change_auto_connection("VerticaDSN")
set_option("sql_on", True)
-- after downloading my CSV files from ftp to a local directory:
csv_df = pd.read_csv('C:my_path\mycsv.csv')

Now the csv_df is a pandas dataframe and not a Vdataframe and I cannot use the Vdataframe functions on it.
The only workaround I can think of is inserting this csv file into a vertica table in the db and then reading it back as a Vdataframe.- but it is such a crooked way of solving it with so much unnecessary overhead.

Any ideas?

P.S - I found out that the opposite option exists : Vdataframe.to_pandas , but obviously I need the other way around.
Many thanks
Keren

kfruchtman · May 2021

anybody?

badrouali · May 2021

Hi Keren,
vDataFrames are pointing on some in-DB relation. To solve your problem you can use the VerticaPy read_csv function. It will ingest the file and create a new relation: https://www.vertica.com/python/documentation_last/utilities/read_csv/index

Badr

kfruchtman · May 2021

Hi Badr
Thanks for your answer.So if It is only a pointer to relational tables in the db , every change I will make on the vDataframe (e.g - adding columns, changing column names, aggregations etc...) would be changed immediately in the db? I don't have to input it to the db? and if the vDataframe is only a pointer to the db so what is the point of having a vdataframe.to_db function?
Thanks for helping me understand!
Keren

badrouali · May 2021

Hi Keren,
Not really, it is working as a Relation. Take it as a constantly updated view. By doing modifications, you'll increase the SQL pipeline code generation. The vDataFrame.to_db is just used to store it inside the DB: Creating a new view, table or temporary table. It can be used to make the process faster.
For example:

from verticapy.datasets import *
titanic = load_titanic()
print(titanic.current_relation())

# Output
"public"."titanic"

If I do some modifications:

titanic["boat"].drop()
titanic.fillna()
print(titanic.current_relation())

# Output
(
   SELECT
     "pclass",
     "survived",
     "name",
     "gender",
     COALESCE("age", 30.1524573721163) AS "age",
     "sibsp",
     "parch",
     "ticket",
     COALESCE("fare", 33.9637936739659) AS "fare",
     COALESCE("cabin", 'C23 C25 C2') AS "cabin",
     COALESCE("embarked", 'S') AS "embarked",
     COALESCE("body", 164.14406779661) AS "body",
     COALESCE("homedest", 'New York, NY') AS "homedest",
     "split" 
   FROM
 (
   SELECT
     "pclass",
     "survived",
     "name",
     "gender",
     "age",
     "sibsp",
     "parch",
     "ticket",
     "fare",
     "cabin",
     "embarked",
     "body",
     "homedest",
     "split" 
   FROM
 "public"."titanic") 
VERTICAPY_SUBTABLE) 
VERTICAPY_SUBTABLE

Of course, if I do thousand of modifications then the SQL pipeline becomes huge and then in order to work more properly, I can store it as a table. I used the "drop" method but it will just not select the column in the SQL statement. Nothing affects your database relations except if you implicitly do it.

drop("public.titanic")

will drop the table.

The biggest advantage is to be able to use sophisticated methods without having to do in-memory computations.

kfruchtman · May 2021

Thanks for your reply!
I think I do understand it better now.
Like everything in life inorder to understand it perfect I I need to practice it and do hands on.
I am struggling to install the internal datasets for that now.VerticaPy seems amazing!
Many thanks
Keren

VerticaPy and Vdataframe question

Answers

Leave a Comment