1

I have to read a CSV every 20 seconds. Each CSV contains min. of 500 to max. 60000 lines. I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item. The field to check for uniqueness is also indexed.

So, I read the file in chunks and use the IN clause to get the items already in the database.

Is there a better way of doing it?

3
  • 1
    Add proper unique indexes and just perform the inserts, ignoring duplicate errors. Commented Oct 6, 2012 at 3:26
  • will that have better performance in large sets of data? Commented Oct 6, 2012 at 3:29
  • 1
    @meagar: You can't just "ignore duplicate errors" with set-based operations. The transaction will be rolled back and nothing gets done. Would only make sense if you INSERT rows one by one - which would be a terrible idea, as far as performance is concerned. Commented Oct 6, 2012 at 3:47

2 Answers 2

3

This should perform well:

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data

COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
LEFT   JOIN tbl USING (tbl_id)
WHERE  tbl.tbl_id IS NULL;

DROP TABLE tmp; -- else dropped at end of session automatically

Closely related to this answer.

Sign up to request clarification or add additional context in comments.

3 Comments

@TheRaaaZ: Join on enough columns to make it unique. All columns if you have to. The USING or the NATURAL clause would come in handy for that. Normally, every table should have a primary key, though.
@ervin I've columns array, can you please tell what will be the query if columns array have 5 columns col1-col5?
@TheRaaaZ: Start a new question, giving a complete picture. You can always link to this one for reference.
1

First just for completeness I changed Erwin's code to use except

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
except
select *
from tbl

DROP TABLE tmp;

Then I resolved to test it myself. I tested it in 9.1 with a mostly untouched postgresql.conf. The target table contains 10 million rows and the origin table 30 thousand. 15 thousand already exists in the target table.

create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;

I asked for the explain of the select part only. The except version:

explain
select *
from tmp
except
select *
from tbl
;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 HashSetOp Except  (cost=0.00..270098.68 rows=200 width=4)
   ->  Append  (cost=0.00..245018.94 rows=10031897 width=4)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..771.40 rows=31920 width=4)
               ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..244247.54 rows=9999977 width=4)
               ->  Seq Scan on tbl  (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)

The outer join version:

explain
select *
from 
    tmp
    left join
    tbl using (id)
where tbl.id is null
;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.00..208142.58 rows=15960 width=4)
   ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
   ->  Index Scan using tbl_pkey on tbl  (cost=0.00..7.80 rows=1 width=4)
         Index Cond: (tmp.id = id)
(4 rows)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.