How can I import csv with duplicate data because I was set primary key while table creation?

naresh_way2naresh_way2 Registered User

I have set primary key in vertica table and I tried to copy data from csv into my table but it was not copying because the primary key field has duplicate ids how can skip that row and insert all other data into my table.I was doing like this..
testing=> create table vertical(id int primary key enabled,name varchar);
CREATE TABLE
testing=> copy vertical from '/home/naresh/Desktop/result.csv' parser fcsvparser();
ERROR 6745: Duplicate key values: 'id=3' -- violates constraint 'public.vertical.C_PRIMARY'
teststing=>

My csv is like this
1 abc
2 abcd
3 cba
4 adbc
5 bcd
6 rgukt
7 adbcc
3 erthgf

How can i insert this csv without duplicates and Is there any possible to store that duplicated data in another file?

Answers

  • Jim_KnicelyJim_Knicely Employee, Registered User, VerticaExpert

    One option is to disabled the PK. COPY the data, DELETE the duplicates, and then re-enable the PK.

    Example:

    dbadmin=> \! cat /home/dbadmin/result.csv
    1|abc
    2|abcd
    3|cba
    4|adbc
    5|bcd
    6|rgukt
    7|adbcc
    3|erthgf
    
    dbadmin=> copy vertical from '/home/dbadmin/result.csv';
    ERROR 6745:  Duplicate key values: 'id=3' -- violates constraint 'public.vertical.vertical_pk'
    
    dbadmin=> alter table vertical alter constraint vertical_pk disabled;
    ALTER TABLE
    
    dbadmin=> copy vertical from '/home/dbadmin/result.csv';
     Rows Loaded
    -------------
               8
    (1 row)
    
    dbadmin=> select * from vertical;
     id | stuff
    ----+--------
      1 | abc
      2 | abcd
      3 | cba
      3 | erthgf
      4 | adbc
      5 | bcd
      6 | rgukt
      7 | adbcc
    (8 rows)
    
    dbadmin=> delete from vertical where id || stuff not in (select id || stuff from vertical limit 1 over (partition by id order by id));
     OUTPUT
    --------
          1
    (1 row)
    
    dbadmin=> select * from vertical;
     id | stuff
    ----+-------
      1 | abc
      2 | abcd
      3 | cba
      4 | adbc
      5 | bcd
      6 | rgukt
      7 | adbcc
    (7 rows)
    
    dbadmin=> alter table vertical alter constraint vertical_pk enabled;
    ALTER TABLE
    
  • Jim_KnicelyJim_Knicely Employee, Registered User, VerticaExpert

    Or, you could create a temporary table, load the data there, then copy only the distinct records to your real table...

    Example:

    dbadmin=> create table vertical_temp like vertical;
    CREATE TABLE
    
    dbadmin=> copy vertical_temp from '/home/dbadmin/result.csv';
     Rows Loaded
    -------------
               8
    (1 row)
    
    dbadmin=> select * from vertical_temp;
     id | stuff
    ----+--------
      1 | abc
      2 | abcd
      3 | cba
      3 | erthgf
      4 | adbc
      5 | bcd
      6 | rgukt
      7 | adbcc
    (8 rows)
    
    dbadmin=> insert into vertical (select * from vertical_temp limit 1 over (partition by id order by id));
     OUTPUT
    --------
          7
    (1 row)
    
    dbadmin=> select * from vertical;
     id | stuff
    ----+-------
      1 | abc
      2 | abcd
      3 | cba
      4 | adbc
      5 | bcd
      6 | rgukt
      7 | adbcc
    (7 rows)
    
    dbadmin=> drop table vertical_temp;
    DROP TABLE
    
  • naresh_way2naresh_way2 Registered User

    Hi ,
    Yesterday I got this error I'm unable to solve please resolve this issue thanks in advance,

    events.js:163
    throw er; // Unhandled 'error' event
    ^
    Error: Unavailable: initiator locks for query - Locking failure: Timed out X locking Table:public.test2. X held by [user user_testing (select count(*) from test2;)]. Your current transaction isolation level is READ COMMITTED

  • naresh_way2naresh_way2 Registered User

    Hi,
    Jim_Knicely I have a user with 600 data points, and 5 crores users.How can I store that data in vertica? which is the best procedure to store that data to get counts with in smaller time?Can u suggest me?

  • naresh_way2naresh_way2 Registered User

    Sorry jim I have requirement like this.I have to store arrays in vertica table(not in flex table).If it unable to store what is the best way to store that type of data.My array length lies between 200 to 1000 and cardinality length lies between 10000 to 100000.It exists for each user and no.of users are 10crore plz tell me the solution for this and updates also should be fast.Is this database support for all these requirements?

  • Jim_KnicelyJim_Knicely Employee, Registered User, VerticaExpert

    You could create a wide table where you'd have a column for each data point. Then so that you can get super fast counts, you can create a Live Aggregate Projection.

    Example (Your table would have all 600 data points):

    dbadmin=> select * from wide_table;
     some_unique_key | data_point1 | data_point2 | data_point3 | data_point600
    -----------------+-------------+-------------+-------------+---------------
                   1 |           1 |             |           1 |             3
                   2 |           1 |           1 |             |             3
                   2 |           1 |           1 |           1 |             3
    (3 rows)
    
    dbadmin=> create projection wide_table_lap as select some_unique_key, count(data_point1) data_point1_cnt, count(data_point2) data_point2_cnt, count(data_point3) data_point3_cnt, count(data_point600) data_point600_cnt from wide_table group by some_unique_key;
    WARNING 6852:  Live Aggregate Projection "wide_table_lap" will be created for "wide_table". Data in "wide_table" will be neither updated nor deleted
    WARNING 4468:  Projection <public.wide_table_lap> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
              The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
    CREATE PROJECTION
    
    dbadmin=> select refresh('wide_table');
                                                                                                                                                     refresh               
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Refresh completed with the following outcomes:
    Projection Name: [Anchor Table] [Status] [Refresh Method] [Error Count] [Duration (sec)]
    ----------------------------------------------------------------------------------------
    "public"."wide_table_lap": [wide_table] [refreshed] [scratch] [0] [0]
    
    (1 row)
    
    dbadmin=> select some_unique_key, count(data_point1) data_point1_cnt from wide_table group by some_unique_key;
     some_unique_key | data_point1_cnt
    -----------------+-----------------
                   1 |               1
                   2 |               2
    (2 rows)
    
    dbadmin=> explain select some_unique_key, count(data_point1) data_point1_cnt from wide_table group by some_unique_key;
                                                                                                                                      QUERY PLAN                           
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     ------------------------------
     QUERY PLAN DESCRIPTION:
     ------------------------------
    
     explain select some_unique_key, count(data_point1) data_point1_cnt from wide_table group by some_unique_key;
    
     Access Path:
     +-GROUPBY PIPELINED [Cost: 2, Rows: 2 (NO STATISTICS)] (PATH ID: 1)
     |  Aggregates: count(wide_table_lap.data_point1_cnt)
     |  Group By: wide_table_lap.some_unique_key
     | +---> STORAGE ACCESS for public.wide_table_lap (Rewritten LAP) [Cost: 1, Rows: 2 (NO STATISTICS)] (PATH ID: 2)
     | |      Projection: public.wide_table_lap
     | |      Materialize: wide_table_lap.some_unique_key, wide_table_lap.data_point1_cnt
    
    
     ------------------------------
     -----------------------------------------------
     PLAN: BASE QUERY PLAN (GraphViz Format)
     -----------------------------------------------
     digraph G {
     graph [rankdir=BT, label = "BASE QUERY PLAN\nQuery: explain select some_unique_key, count(data_point1) data_point1_cnt from wide_table group by some_unique_key;\n\nAll Nodes Vector: \n\n  node[0]=v_sfdc_node0001 (initiator) Up\n", labelloc=t, labeljust=l ordering=out]
     0[label = "Root \nOutBlk=[UncTuple(2)]", color = "green", shape = "house"];
     1[label = "NewEENode \nOutBlk=[UncTuple(2)]", color = "green", shape = "box"];
     2[label = "GroupByPipe: 1 keys\nAggs:\n  count(wide_table_lap.data_point1_cnt)\nUnc: Integer(8)\nUnc: Integer(8)", color = "green", shape = "box"];
     3[label = "StorageMergeStep: wide_table_lap; 1 sorted\nUnc: Integer(8)\nUnc: Integer(8)", color = "purple", shape = "box"];
     4[label = "ScanStep: wide_table_lap\nsome_unique_key\ndata_point1_cnt\nUnc: Integer(8)\nUnc: Integer(8)", color = "brown", shape = "box"];
     1->0 [label = "V[0] C=2",color = "black",style="bold", arrowtail="inv"];
     2->1 [label = "0",color = "blue"];
     3->2 [label = "0",color = "blue"];
     4->3 [label = "0",color = "blue"];
     }
    (31 rows)
    

    See:
    https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/AnalyzingData/AggregatedData/LiveAggregateProjections.htm

  • naresh_way2naresh_way2 Registered User

    Hi Jim. Thank you for your reply. I have a question regarding the performance of updates in real time. My current performance on 4core server(4GB RAM,Single node) with 5Cr records(23 columns- 21 are int and other are varchar) is 20records updates/s.

    How can I scale it upto 10K updates/s.
    Please suggest me the configurations or the architecture from which I can attain it. I followed this url: https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/AnalyzingData/Optimizations /OptimizingDELETEsAndUPDATEsForPerformance.htm but It is not giving me the result I want. Can you also suggest me the scale I can get from this configuration that I described before.

  • naresh_way2naresh_way2 Registered User

    Hi,
    Jim,thank you for your reply.I have a question.Is Vertica support for UPDATES in real time?If it is at max how many records it can update per second?To make faster updates in real time what architecture we need to follow? My case is like this
    I have 5 crore records(23 data points 21 columns are integers and 2 columns are varchars) and I made id as primary key.
    My current performance is 30 records updates/s
    I want to scale it upto 10k records/s.Can u suggest me in what ways configuration changes can give me that performance?
    I followed this url : https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/AnalyzingData/Optimizations/OptimizingDELETEsAndUPDATEsForPerformance.htm but it was not working for me.Now Iam using it for analytics and i also wanted to serve the data.
    My requirement to serve data is if we serve one record in one query it should not repeat in any query that means we need to change the status of the record (for this i need update). Here it is taking more time how can i face this challenge with vertica(instead of updates or with updates). That means data should not repeat(data should rotate).
    And it also shows affect on counts also.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file