We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Throughput concern when inserting data into the DB — Vertica Forum

Throughput concern when inserting data into the DB

First of all, this is the second time I post this question. The first one was "under examination" by the forum admin and it just disappeared... Is it because I use the "code" format for some of the code? Anyway, if you see a duplicate please ignore.

We have a Vertica DB cluster (3 nodes) that came in as part of an integrated solution by a vendor. We got permission to also use the DB to store some other traffic stats we have. I created a table that looks like below. The issue is that it takes over 40s to write 1k entries. I write the data via the python library I downloaded as part of the client driver.

packetlogic=> \d pason.dscp2
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+------------------------+--------------+------+---------+----------+-------------+-------------
pason | dscp2 | time_stamp | timestamptz | 8 | | t | f |
pason | dscp2 | subscriber | varchar(256) | 256 | | f | f |
pason | dscp2 | ip_address | varchar(256) | 256 | | t | f |
pason | dscp2 | remote_address | varchar(256) | 256 | | t | f |
pason | dscp2 | remote_vhost | varchar(256) | 256 | | f | f |
pason | dscp2 | signature_service_name | varchar(256) | 256 | | f | f |
pason | dscp2 | dscp_out | int | 8 | | t | f |
pason | dscp2 | bytes_in | int | 8 | | t | f |
pason | dscp2 | bytes_out | int | 8 | | t | f |
(9 rows)

This is how I write the data:
cur.executemany("""INSERT INTO pason.dscp2 (time_stamp, subscriber, ip_address, remote_address, signature_service_name, dscp_out, bytes_in, bytes_out) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""", data1)
cur.execute("COMMIT")

I write the data from a remote host via a python script. The host is in the same LAN as the DB cluster with 10GB bandwidth and close to 0 latency.

I am not from the DB background. It could be that my table was not correctly properly causing the slowness. The vendor uses the DB constantly with a lot more records without a performance issue so I don't think it is anything to do with hardware.

Please help! Thanks!

Comments

  • marcothesanemarcothesane - Select Field - Administrator

    How many rows do you insert with one call , and then COMMIT?

    Is it possible to write the output to a CSV file, and then to start vsql with a COPY command?

    The copy command (if the columns in the CSV file have the same order as the ones in the table) would be:

    COPY pason.dcsp2 FROM LOCAL 'dcsp2.csv' DELIMITER ',' ENCLOSED BY '"' DIRECT REJECTED DATA 'dcsp2.bad' EXCEPTIONS 'dcsp2.log';
    

    Then, compare the difference in performance.

  • edited February 2019

    Thanks @marcothesane for your quick response. It might be a little hard. The cluster is locked down by the vendor. I don't know if I can get permission, or get them to save a CSV file onto the host. All I have is ODBC access to the DB... I will ask them.

    The "Data1" has 1k entries. It is like this:

    data1[0]
    ('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452)
    

    I did try to copy and paste in 1k "insert" SQL queries in the vsql console directly and the performance is about the same.

  • Hi @marcothesane , I just tried it and it works instantly...

    packetlogic=> select getdate();
              getdate
    ----------------------------
     2019-02-15 11:28:09.001883
    (1 row)
    
    packetlogic=> COPY pason.dscp2 FROM LOCAL './dscp2.csv' DELIMITER ',' ENCLOSED BY '"' DIRECT REJECTED DATA 'dscp2.bad' EXCEPTIONS 'dscp2.log';
     Rows Loaded
    -------------
            1000
    (1 row)
    
    packetlogic=> select getdate();
              getdate
    ----------------------------
     2019-02-15 11:28:09.056093
    (1 row)
    
    packetlogic=>
    packetlogic=> commit;
    

    Is there something with the library I am using? Thanks!

  • Hi @marcothesane , I just tried to use the COPY and I used it in my script but the vertica python library doesn't support it...


    Is there any other options to run this COPY command? Thanks!

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited February 2019
  • Thanks Jim. I tried it and it works great now. It only looks to me that the process is a little wasteful because I have to first save the data to a csv then read it by copying it to a DB however performance wise it is good so I am happy with the solution. Thanks for all your help.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file