How can I import csv with duplicate data because I was set primary key while table creation?

naresh_way2 · October 2017

I have set primary key in vertica table and I tried to copy data from csv into my table but it was not copying because the primary key field has duplicate ids how can skip that row and insert all other data into my table.I was doing like this..
testing=> create table vertical(id int primary key enabled,name varchar);
CREATE TABLE
testing=> copy vertical from '/home/naresh/Desktop/result.csv' parser fcsvparser();
ERROR 6745: Duplicate key values: 'id=3' -- violates constraint 'public.vertical.C_PRIMARY'
teststing=>

My csv is like this
1 abc
2 abcd
3 cba
4 adbc
5 bcd
6 rgukt
7 adbcc
3 erthgf

How can i insert this csv without duplicates and Is there any possible to store that duplicated data in another file?

Jim_Knicely · October 2017

One option is to disabled the PK. COPY the data, DELETE the duplicates, and then re-enable the PK.

Example:

dbadmin=> \! cat /home/dbadmin/result.csv
1|abc
2|abcd
3|cba
4|adbc
5|bcd
6|rgukt
7|adbcc
3|erthgf

dbadmin=> copy vertical from '/home/dbadmin/result.csv';
ERROR 6745:  Duplicate key values: 'id=3' -- violates constraint 'public.vertical.vertical_pk'

dbadmin=> alter table vertical alter constraint vertical_pk disabled;
ALTER TABLE

dbadmin=> copy vertical from '/home/dbadmin/result.csv';
 Rows Loaded
-------------
           8
(1 row)

dbadmin=> select * from vertical;
 id | stuff
----+--------
  1 | abc
  2 | abcd
  3 | cba
  3 | erthgf
  4 | adbc
  5 | bcd
  6 | rgukt
  7 | adbcc
(8 rows)

dbadmin=> delete from vertical where id || stuff not in (select id || stuff from vertical limit 1 over (partition by id order by id));
 OUTPUT
--------
      1
(1 row)

dbadmin=> select * from vertical;
 id | stuff
----+-------
  1 | abc
  2 | abcd
  3 | cba
  4 | adbc
  5 | bcd
  6 | rgukt
  7 | adbcc
(7 rows)

dbadmin=> alter table vertical alter constraint vertical_pk enabled;
ALTER TABLE

Jim_Knicely · October 2017

Or, you could create a temporary table, load the data there, then copy only the distinct records to your real table...

Example:

dbadmin=> create table vertical_temp like vertical;
CREATE TABLE

dbadmin=> copy vertical_temp from '/home/dbadmin/result.csv';
 Rows Loaded
-------------
           8
(1 row)

dbadmin=> select * from vertical_temp;
 id | stuff
----+--------
  1 | abc
  2 | abcd
  3 | cba
  3 | erthgf
  4 | adbc
  5 | bcd
  6 | rgukt
  7 | adbcc
(8 rows)

dbadmin=> insert into vertical (select * from vertical_temp limit 1 over (partition by id order by id));
 OUTPUT
--------
      7
(1 row)

dbadmin=> select * from vertical;
 id | stuff
----+-------
  1 | abc
  2 | abcd
  3 | cba
  4 | adbc
  5 | bcd
  6 | rgukt
  7 | adbcc
(7 rows)

dbadmin=> drop table vertical_temp;
DROP TABLE

naresh_way2 · October 2017

Hi ,
Yesterday I got this error I'm unable to solve please resolve this issue thanks in advance,

events.js:163
throw er; // Unhandled 'error' event
^
Error: Unavailable: initiator locks for query - Locking failure: Timed out X locking Table:public.test2. X held by [user user_testing (select count(*) from test2;)]. Your current transaction isolation level is READ COMMITTED

Jim_Knicely · October 2017

See:
https://forum.vertica.com/discussion/239078/read-commited-error#latest

naresh_way2 · October 2017

Hi,
Jim_Knicely I have a user with 600 data points, and 5 crores users.How can I store that data in vertica? which is the best procedure to store that data to get counts with in smaller time?Can u suggest me?

naresh_way2 · October 2017

Sorry jim I have requirement like this.I have to store arrays in vertica table(not in flex table).If it unable to store what is the best way to store that type of data.My array length lies between 200 to 1000 and cardinality length lies between 10000 to 100000.It exists for each user and no.of users are 10crore plz tell me the solution for this and updates also should be fast.Is this database support for all these requirements?

Jim_Knicely · November 2017

You could create a wide table where you'd have a column for each data point. Then so that you can get super fast counts, you can create a Live Aggregate Projection.

Example (Your table would have all 600 data points):

dbadmin=> select * from wide_table;
 some_unique_key | data_point1 | data_point2 | data_point3 | data_point600
-----------------+-------------+-------------+-------------+---------------
               1 |           1 |             |           1 |             3
               2 |           1 |           1 |             |             3
               2 |           1 |           1 |           1 |             3
(3 rows)

dbadmin=> create projection wide_table_lap as select some_unique_key, count(data_point1) data_point1_cnt, count(data_point2) data_point2_cnt, count(data_point3) data_point3_cnt, count(data_point600) data_point600_cnt from wide_table group by some_unique_key;
WARNING 6852:  Live Aggregate Projection "wide_table_lap" will be created for "wide_table". Data in "wide_table" will be neither updated nor deleted
WARNING 4468:  Projection <public.wide_table_lap> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
          The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
CREATE PROJECTION

dbadmin=> select refresh('wide_table');
                                                                                                                                                 refresh               
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Refresh completed with the following outcomes:
Projection Name: [Anchor Table] [Status] [Refresh Method] [Error Count] [Duration (sec)]
----------------------------------------------------------------------------------------
"public"."wide_table_lap": [wide_table] [refreshed] [scratch] [0] [0]

(1 row)

dbadmin=> select some_unique_key, count(data_point1) data_point1_cnt from wide_table group by some_unique_key;
 some_unique_key | data_point1_cnt
-----------------+-----------------
               1 |               1
               2 |               2
(2 rows)

dbadmin=> explain select some_unique_key, count(data_point1) data_point1_cnt from wide_table group by some_unique_key;
                                                                                                                                  QUERY PLAN                           
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 explain select some_unique_key, count(data_point1) data_point1_cnt from wide_table group by some_unique_key;

 Access Path:
 +-GROUPBY PIPELINED [Cost: 2, Rows: 2 (NO STATISTICS)] (PATH ID: 1)
 |  Aggregates: count(wide_table_lap.data_point1_cnt)
 |  Group By: wide_table_lap.some_unique_key
 | +---> STORAGE ACCESS for public.wide_table_lap (Rewritten LAP) [Cost: 1, Rows: 2 (NO STATISTICS)] (PATH ID: 2)
 | |      Projection: public.wide_table_lap
 | |      Materialize: wide_table_lap.some_unique_key, wide_table_lap.data_point1_cnt


 ------------------------------
 -----------------------------------------------
 PLAN: BASE QUERY PLAN (GraphViz Format)
 -----------------------------------------------
 digraph G {
 graph [rankdir=BT, label = "BASE QUERY PLAN\nQuery: explain select some_unique_key, count(data_point1) data_point1_cnt from wide_table group by some_unique_key;\n\nAll Nodes Vector: \n\n  node[0]=v_sfdc_node0001 (initiator) Up\n", labelloc=t, labeljust=l ordering=out]
 0[label = "Root \nOutBlk=[UncTuple(2)]", color = "green", shape = "house"];
 1[label = "NewEENode \nOutBlk=[UncTuple(2)]", color = "green", shape = "box"];
 2[label = "GroupByPipe: 1 keys\nAggs:\n  count(wide_table_lap.data_point1_cnt)\nUnc: Integer(8)\nUnc: Integer(8)", color = "green", shape = "box"];
 3[label = "StorageMergeStep: wide_table_lap; 1 sorted\nUnc: Integer(8)\nUnc: Integer(8)", color = "purple", shape = "box"];
 4[label = "ScanStep: wide_table_lap\nsome_unique_key\ndata_point1_cnt\nUnc: Integer(8)\nUnc: Integer(8)", color = "brown", shape = "box"];
 1->0 [label = "V[0] C=2",color = "black",style="bold", arrowtail="inv"];
 2->1 [label = "0",color = "blue"];
 3->2 [label = "0",color = "blue"];
 4->3 [label = "0",color = "blue"];
 }
(31 rows)

See:
https://my.vertica.com/docs/9.0.x/HTML/index.htm#Authoring/AnalyzingData/AggregatedData/LiveAggregateProjections.htm

naresh_way2 · November 2017

Hi Jim. Thank you for your reply. I have a question regarding the performance of updates in real time. My current performance on 4core server(4GB RAM,Single node) with 5Cr records(23 columns- 21 are int and other are varchar) is 20records updates/s.

How can I scale it upto 10K updates/s.
Please suggest me the configurations or the architecture from which I can attain it. I followed this url: https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/AnalyzingData/Optimizations /OptimizingDELETEsAndUPDATEsForPerformance.htm but It is not giving me the result I want. Can you also suggest me the scale I can get from this configuration that I described before.

naresh_way2 · November 2017

Hi,
Jim,thank you for your reply.I have a question.Is Vertica support for UPDATES in real time?If it is at max how many records it can update per second?To make faster updates in real time what architecture we need to follow? My case is like this
I have 5 crore records(23 data points 21 columns are integers and 2 columns are varchars) and I made id as primary key.
My current performance is 30 records updates/s
I want to scale it upto 10k records/s.Can u suggest me in what ways configuration changes can give me that performance?
I followed this url : https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/AnalyzingData/Optimizations/OptimizingDELETEsAndUPDATEsForPerformance.htm but it was not working for me.Now Iam using it for analytics and i also wanted to serve the data.
My requirement to serve data is if we serve one record in one query it should not repeat in any query that means we need to change the status of the record (for this i need update). Here it is taking more time how can i face this challenge with vertica(instead of updates or with updates). That means data should not repeat(data should rotate).
And it also shows affect on counts also.

How can I import csv with duplicate data because I was set primary key while table creation?

Answers

Leave a Comment