Throughput concern when inserting data into the DB
First of all, this is the second time I post this question. The first one was "under examination" by the forum admin and it just disappeared... Is it because I use the "code" format for some of the code? Anyway, if you see a duplicate please ignore.
We have a Vertica DB cluster (3 nodes) that came in as part of an integrated solution by a vendor. We got permission to also use the DB to store some other traffic stats we have. I created a table that looks like below. The issue is that it takes over 40s to write 1k entries. I write the data via the python library I downloaded as part of the client driver.
packetlogic=> \d pason.dscp2
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
pason | dscp2 | time_stamp | timestamptz | 8 | | t | f |
pason | dscp2 | subscriber | varchar(256) | 256 | | f | f |
pason | dscp2 | ip_address | varchar(256) | 256 | | t | f |
pason | dscp2 | remote_address | varchar(256) | 256 | | t | f |
pason | dscp2 | remote_vhost | varchar(256) | 256 | | f | f |
pason | dscp2 | signature_service_name | varchar(256) | 256 | | f | f |
pason | dscp2 | dscp_out | int | 8 | | t | f |
pason | dscp2 | bytes_in | int | 8 | | t | f |
pason | dscp2 | bytes_out | int | 8 | | t | f |
This is how I write the data:
cur.executemany("""INSERT INTO pason.dscp2 (time_stamp, subscriber, ip_address, remote_address, signature_service_name, dscp_out, bytes_in, bytes_out) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""", data1)
I write the data from a remote host via a python script. The host is in the same LAN as the DB cluster with 10GB bandwidth and close to 0 latency.
I am not from the DB background. It could be that my table was not correctly properly causing the slowness. The vendor uses the DB constantly with a lot more records without a performance issue so I don't think it is anything to do with hardware.
Please help! Thanks!
How many rows do you insert with one call , and then COMMIT?
Is it possible to write the output to a CSV file, and then to start vsql with a COPY command?
The copy command (if the columns in the CSV file have the same order as the ones in the table) would be:
Then, compare the difference in performance.
Thanks @marcothesane for your quick response. It might be a little hard. The cluster is locked down by the vendor. I don't know if I can get permission, or get them to save a CSV file onto the host. All I have is ODBC access to the DB... I will ask them.
The "Data1" has 1k entries. It is like this:
I did try to copy and paste in 1k "insert" SQL queries in the vsql console directly and the performance is about the same.
Hi @marcothesane , I just tried it and it works instantly...
Is there something with the library I am using? Thanks!
Hi @marcothesane , I just tried to use the COPY and I used it in my script but the vertica python library doesn't support it...
Is there any other options to run this COPY command? Thanks!
Check out this example:
Thanks Jim. I tried it and it works great now. It only looks to me that the process is a little wasteful because I have to first save the data to a csv then read it by copying it to a DB however performance wise it is good so I am happy with the solution. Thanks for all your help.