Throughput concern when inserting data into the DB

dzhaodzhao Registered User

First of all, this is the second time I post this question. The first one was "under examination" by the forum admin and it just disappeared... Is it because I use the "code" format for some of the code? Anyway, if you see a duplicate please ignore.

We have a Vertica DB cluster (3 nodes) that came in as part of an integrated solution by a vendor. We got permission to also use the DB to store some other traffic stats we have. I created a table that looks like below. The issue is that it takes over 40s to write 1k entries. I write the data via the python library I downloaded as part of the client driver.

packetlogic=> \d pason.dscp2
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+------------------------+--------------+------+---------+----------+-------------+-------------
pason | dscp2 | time_stamp | timestamptz | 8 | | t | f |
pason | dscp2 | subscriber | varchar(256) | 256 | | f | f |
pason | dscp2 | ip_address | varchar(256) | 256 | | t | f |
pason | dscp2 | remote_address | varchar(256) | 256 | | t | f |
pason | dscp2 | remote_vhost | varchar(256) | 256 | | f | f |
pason | dscp2 | signature_service_name | varchar(256) | 256 | | f | f |
pason | dscp2 | dscp_out | int | 8 | | t | f |
pason | dscp2 | bytes_in | int | 8 | | t | f |
pason | dscp2 | bytes_out | int | 8 | | t | f |
(9 rows)

This is how I write the data:
cur.executemany("""INSERT INTO pason.dscp2 (time_stamp, subscriber, ip_address, remote_address, signature_service_name, dscp_out, bytes_in, bytes_out) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""", data1)
cur.execute("COMMIT")

I write the data from a remote host via a python script. The host is in the same LAN as the DB cluster with 10GB bandwidth and close to 0 latency.

I am not from the DB background. It could be that my table was not correctly properly causing the slowness. The vendor uses the DB constantly with a lot more records without a performance issue so I don't think it is anything to do with hardware.

Please help! Thanks!

Comments

  • marcothesanemarcothesane Employee, Registered User

    How many rows do you insert with one call , and then COMMIT?

    Is it possible to write the output to a CSV file, and then to start vsql with a COPY command?

    The copy command (if the columns in the CSV file have the same order as the ones in the table) would be:

    COPY pason.dcsp2 FROM LOCAL 'dcsp2.csv' DELIMITER ',' ENCLOSED BY '"' DIRECT REJECTED DATA 'dcsp2.bad' EXCEPTIONS 'dcsp2.log';
    

    Then, compare the difference in performance.

  • dzhaodzhao Registered User
    edited February 15

    Thanks @marcothesane for your quick response. It might be a little hard. The cluster is locked down by the vendor. I don't know if I can get permission, or get them to save a CSV file onto the host. All I have is ODBC access to the DB... I will ask them.

    The "Data1" has 1k entries. It is like this:

    data1[0]
    ('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452)
    

    I did try to copy and paste in 1k "insert" SQL queries in the vsql console directly and the performance is about the same.

  • dzhaodzhao Registered User

    Hi @marcothesane , I just tried it and it works instantly...

    packetlogic=> select getdate();
              getdate
    ----------------------------
     2019-02-15 11:28:09.001883
    (1 row)
    
    packetlogic=> COPY pason.dscp2 FROM LOCAL './dscp2.csv' DELIMITER ',' ENCLOSED BY '"' DIRECT REJECTED DATA 'dscp2.bad' EXCEPTIONS 'dscp2.log';
     Rows Loaded
    -------------
            1000
    (1 row)
    
    packetlogic=> select getdate();
              getdate
    ----------------------------
     2019-02-15 11:28:09.056093
    (1 row)
    
    packetlogic=>
    packetlogic=> commit;
    

    Is there something with the library I am using? Thanks!

  • dzhaodzhao Registered User

    Hi @marcothesane , I just tried to use the COPY and I used it in my script but the vertica python library doesn't support it...

    >>> import vertica_db_client
    >>> cur.execute("""COPY pason.dscp FROM LOCAL '/data/procera/dscp.csv' DELIMITER ',' DIRECT REJECTED DATA 'dscp.bad' EXCEPTIONS 'dscp.log'""")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    vertica_db_client.NotSupportedError: COPY LOCAL is not supported
    

    Is there any other options to run this COPY command? Thanks!

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert
    edited February 20
  • dzhaodzhao Registered User

    Thanks Jim. I tried it and it works great now. It only looks to me that the process is a little wasteful because I have to first save the data to a csv then read it by copying it to a DB however performance wise it is good so I am happy with the solution. Thanks for all your help.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file