Throughput concern when inserting data into the DB

dzhao · February 2019

First of all, this is the second time I post this question. The first one was "under examination" by the forum admin and it just disappeared... Is it because I use the "code" format for some of the code? Anyway, if you see a duplicate please ignore.

We have a Vertica DB cluster (3 nodes) that came in as part of an integrated solution by a vendor. We got permission to also use the DB to store some other traffic stats we have. I created a table that looks like below. The issue is that it takes over 40s to write 1k entries. I write the data via the python library I downloaded as part of the client driver.

packetlogic=> \d pason.dscp2
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+------------------------+--------------+------+---------+----------+-------------+-------------
pason | dscp2 | time_stamp | timestamptz | 8 | | t | f |
pason | dscp2 | subscriber | varchar(256) | 256 | | f | f |
pason | dscp2 | ip_address | varchar(256) | 256 | | t | f |
pason | dscp2 | remote_address | varchar(256) | 256 | | t | f |
pason | dscp2 | remote_vhost | varchar(256) | 256 | | f | f |
pason | dscp2 | signature_service_name | varchar(256) | 256 | | f | f |
pason | dscp2 | dscp_out | int | 8 | | t | f |
pason | dscp2 | bytes_in | int | 8 | | t | f |
pason | dscp2 | bytes_out | int | 8 | | t | f |
(9 rows)

This is how I write the data:
cur.executemany("""INSERT INTO pason.dscp2 (time_stamp, subscriber, ip_address, remote_address, signature_service_name, dscp_out, bytes_in, bytes_out) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""", data1)
cur.execute("COMMIT")

I write the data from a remote host via a python script. The host is in the same LAN as the DB cluster with 10GB bandwidth and close to 0 latency.

I am not from the DB background. It could be that my table was not correctly properly causing the slowness. The vendor uses the DB constantly with a lot more records without a performance issue so I don't think it is anything to do with hardware.

Please help! Thanks!

marcothesane · February 2019

How many rows do you insert with one call , and then COMMIT?

Is it possible to write the output to a CSV file, and then to start vsql with a COPY command?

The copy command (if the columns in the CSV file have the same order as the ones in the table) would be:

COPY pason.dcsp2 FROM LOCAL 'dcsp2.csv' DELIMITER ',' ENCLOSED BY '"' DIRECT REJECTED DATA 'dcsp2.bad' EXCEPTIONS 'dcsp2.log';

Then, compare the difference in performance.

dzhao · February 2019

Thanks @marcothesane for your quick response. It might be a little hard. The cluster is locked down by the vendor. I don't know if I can get permission, or get them to save a CSV file onto the host. All I have is ODBC access to the DB... I will ask them.

The "Data1" has 1k entries. It is like this:

data1[0]
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452)

I did try to copy and paste in 1k "insert" SQL queries in the vsql console directly and the performance is about the same.

dzhao · February 2019

Hi @marcothesane , I just tried it and it works instantly...

packetlogic=> select getdate();
          getdate
----------------------------
 2019-02-15 11:28:09.001883
(1 row)

packetlogic=> COPY pason.dscp2 FROM LOCAL './dscp2.csv' DELIMITER ',' ENCLOSED BY '"' DIRECT REJECTED DATA 'dscp2.bad' EXCEPTIONS 'dscp2.log';
 Rows Loaded
-------------
        1000
(1 row)

packetlogic=> select getdate();
          getdate
----------------------------
 2019-02-15 11:28:09.056093
(1 row)

packetlogic=>
packetlogic=> commit;

Is there something with the library I am using? Thanks!

dzhao · February 2019

Hi @marcothesane , I just tried to use the COPY and I used it in my script but the vertica python library doesn't support it...

Is there any other options to run this COPY command? Thanks!

Jim_Knicely · February 2019

Check out this example:

https://www.vertica.com/docs/9.2.x/HTML/python_client/loadingdata_copystdin.html

dzhao · February 2019

Thanks Jim. I tried it and it works great now. It only looks to me that the process is a little wasteful because I have to first save the data to a csv then read it by copying it to a DB however performance wise it is good so I am happy with the solution. Thanks for all your help.

We're Moving!

Create My New Community Account Now

Throughput concern when inserting data into the DB

Comments

Leave a Comment