Local temporary table Vs Physical Table for Huge data load
Hi,
I have a scenario where I need a temporary staging area for my data. Data size is somewhere from 20 GB to 50GB. Source of the data is another physical table joined with one or more different tables.
The options I see are:
1. Create a Local Temporary table ( I don't want the data after the session)
2. Create a physical table and insert the records into that tables after truncating it everytime
3. Using "With"
I wanted to understand how the Vertica manages the memory in each of the case and is there any preferred way from i) Performance (time) and ii) memory perspective.
Certainly, one aspect will be not to have sudden peaks in temp usage.
Thanks
NJ
I have a scenario where I need a temporary staging area for my data. Data size is somewhere from 20 GB to 50GB. Source of the data is another physical table joined with one or more different tables.
The options I see are:
1. Create a Local Temporary table ( I don't want the data after the session)
2. Create a physical table and insert the records into that tables after truncating it everytime
3. Using "With"
I wanted to understand how the Vertica manages the memory in each of the case and is there any preferred way from i) Performance (time) and ii) memory perspective.
Certainly, one aspect will be not to have sudden peaks in temp usage.
Thanks
NJ
0
Comments
Assuming you have the same storage devices for both you temp and data spaces, I would go with option 2. By reusing the same physical table, you dont have any catalog locking overheads of dropping/ recreating tables and truncation also clears out all containers, so you dont have purge overhead. You can tune for inserts and selects better as well as the projections persist with the table. Considering the size of your data loads, you can load the data directly to ROS with COPY DIRECT OR with a /*+ direct */ hint of you are doing a INSERT INTO.
/Sajan
The reason I am using a staging area is because I can make a flat load to Vertica and then perform the complex joins and calculations based on this middle zone.
So to Eli's response, can you please help me understand "having a cursor opened".
Also, are you suggesting that using "With" is a better option ? I am using Vertica 6.1.2, and don't think will be changing to 7.1 anytime soon.
"having a cursor opened" - You mention that your "Source of the data is another physical table joined with one or more different tables" , Instead of run this query and put its results into temp and then read from team to do the processing , you can read directly from source and do the processing . In some case is more fast that creating temp .
Regarding WITH , i don't day that WITH is good option her (you not share your SQL’s ) , I just mention that in most of the time peoples that use to work with Oracle /Sqlserver relay on the fact that WITH have the ability to create temp table on the fly and the rest of the query will use it transparently , in Vertica until 7.1 WITH section is just being rewrite to normal join without creating temp table .
There is a file payments.dat that contains millions of records amounting to nearly 10+ GB or so (can be up to 100 GB at times). The data in this file has to be processed and loaded into a Vertica table, I load this data into some staging area first (payment_dat)
To finally load this table, I need to achieve something like Now, Vertica does not let me use that subquery in the USING part, and thus I have to create an intermediate object and that is what I was looking a best option for.
As I mentioned, I can create a Local Temp table, or another physical table.
Please let me know if I should be more detailed.
The terminology of sub query her is misled , you just try to define some kind of inline view , anyway , I attach an example , as I mention you should use view as a workaround , see below example :
drop table target;
create table target (id number ,value varchar2(20));
insert into target values(1,'value of 1');
insert into target values(2,'value of 2');
drop table source1;
create table source1 (id number ,value varchar2(20));
insert into source1 values(1,'source1 of 1');
insert into source1 values(2,'source1 of 2');
insert into source1 values(3,'source1 of 2');
insert into source1 values(4,'source1 of 2');
drop table source2;
create table source2 (id number ,value varchar2(20));
insert into source2 values(1,'source2 of 1');
insert into source2 values(2,'source2 of 2');
insert into source2 values(3,'source2 of 2');
insert into source2 values(4,'source2 of 2');
MERGE INTO target
using (select source1.value,source1.id from source1,source2
where source1.id=source2.id
) src
on ( src.id=target.id)
WHEN MATCHED THEN
UPDATE SET
value=src.value
WHEN NOT MATCHED THEN
INSERT VALUES (
src.id,
src.value);
ERROR 5665: Subquery in MERGE is not supported
Create or replace view myview as select source1.value,source1.id from source1,source2
where source1.id=source2.id;
MERGE INTO target
using myview src
on ( src.id=target.id)
WHEN MATCHED THEN
UPDATE SET
value=src.value
WHEN NOT MATCHED THEN
INSERT VALUES (
src.id,
src.value);
CREATE VIEW
dbadmin=> MERGE INTO target
dbadmin-> using myview src
dbadmin-> on ( src.id=target.id)
dbadmin-> WHEN MATCHED THEN
dbadmin-> UPDATE SET
dbadmin-> value=src.value
dbadmin-> WHEN NOT MATCHED THEN
dbadmin-> INSERT VALUES (
dbadmin(> src.id,
dbadmin(> src.value);
OUTPUT
--------
4
(1 row)
I've never performed the metrics, but I've heard good things regarding the performance of merge.
Also, historically, temp tables tend to operate faster than permanent tables, and also don't affect license size. So, I'd tend towards the with clause or the temp table, if you're looking for just pure performance.
Another option, would be a table with a pinned projection. That's the fastest option. This creates the projection only on a single node.