Delete duplicate rows from table with no unique key

Question

How do I delete duplicates rows in Postgres 9 table, the rows are completely duplicates on every field AND there is no individual field that could be used as a unique key so I cant just GROUP BY columns and use a NOT IN statement.

I'm looking for a single SQL statement, not a solution that requires me to create temporary table and insert records into that. I know how to do that but requires more work to fit into my automated process.

Table definition:

jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
   Column   |  Type   | Modifiers
------------+---------+-----------
 label      | text    |
 release_id | integer |
 catno      | text    |
Indexes:
    "releases_labels_catno_idx" btree (catno)
    "releases_labels_name_idx" btree (label)
Foreign-key constraints:
    "foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)

Sample data:

jthinksearch=> select * from releases_labels  where release_id=6155;
    label     | release_id |   catno
--------------+------------+------------
 Warp Records |       6155 | WAP 39 CDR
 Warp Records |       6155 | WAP 39 CDR

stackoverflow.com/a/15926983 or stackoverflow.com/questions/1746213/… — user330315
– user330315, Commented Apr 2, 2015 at 9:17
And I hope that you now have learned not to allow duplicated data. — jarlh
– jarlh, Commented Apr 2, 2015 at 9:21
@jarlh Im working with data provided by someone else, I am well aware that having duplicate data is a bad idea. — Paul Taylor
– Paul Taylor, Commented Apr 2, 2015 at 9:24
Where do those data came from? Are they in another database, in an export file...? What does prevent you from using a temporary table? — gvo
– gvo, Commented Apr 2, 2015 at 9:32
Currenlty the data is loaded from an xml file but it does it row by row, I plan to change it so it writes to csv file and then use dbcopy but writing that code is not a 5 minute task. I can use temp tables but was hoping there was a single SQL statement I could use, as it is I tried the approach below and ended up messing everything up because the INSERt failed but that didnt prevent the next statement working and dropping the table — Paul Taylor
– Paul Taylor, Commented Apr 2, 2015 at 9:53

Nick Barnes · Accepted Answer · 2015-04-02 10:30:05Z

11

If you can afford to rewrite the whole table, this is probably the simplest approach:

WITH Deleted AS (
  DELETE FROM discogs.releases_labels
  RETURNING *
)
INSERT INTO discogs.releases_labels
SELECT DISTINCT * FROM Deleted

If you need to specifically target the duplicated records, you can make use of the internal ctid field, which uniquely identifies a row:

DELETE FROM discogs.releases_labels
WHERE ctid NOT IN (
  SELECT MIN(ctid)
  FROM discogs.releases_labels
  GROUP BY label, release_id, catno
)

Be very careful with ctid; it changes over time. But you can rely on it staying the same within the scope of a single statement.

answered Apr 2, 2015 at 10:30

Nick Barnes

21.8k5 gold badges63 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Paul Taylor Over a year ago

@Nick i like the second query because I understand it, nice straightforward SQL I just didnt know about the ctid field in postgres. I dont really understand the first solution in particular (DELETE FROM discogs.releases_labels RETURNING *)

Nick Barnes Over a year ago

@Paul: The RETURNING clause causes the DELETE to return the deleted rows (in this case, the entire contents of the table). The WITH construct defines a CTE - kind of like a temp table, but local to this statement. So the statement empties the table, copies the contents into the CTE called Deleted, strips out the duplicates with a SELECT DISTINCT, and inserts whatever's left back into release_labels.

Paul Taylor Over a year ago

@Nick Barnes So instead of DELETE FROM discogs.releases_labels RETURNING * could that be replaced with SELECT * from discogs.release_labels to same effect ?

Paul Taylor Over a year ago

Okay marked your question correct as two good answers that actually answer the question

Erwin Brandstetter · Accepted Answer · 2024-08-27 21:52:23Z

6

Single SQL statement

Here is a solution that deletes duplicates in place:

DELETE FROM releases_labels r
WHERE  EXISTS (
   SELECT releases_labels r1
   WHERE  r1 = r
   AND    r1.ctid < r.ctid
   );

Since there is no unique key, I am (ab)using the tuple ID ctid for the purpose. The physically first row survives in each set of dupes.

ctid is a system column that is not part of the associated row type, so when referencing the whole row with table aliases in the expression r1 = r, only visible columns are compared (not the ctid or others). That's why the whole row can be equal and one ctid is still smaller than the other.

With only few duplicates, this is also the fastest of all solutions.
With lots of duplicates, other solutions are faster.

Then I suggest:

ALTER TABLE discogs.releases_labels
ADD COLUMN releases_labels_id serial PRIMARY KEY;

Why does it work with NULL values?

The reason is explained in the chapter Composite Type Comparison in the manual:

The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors (as in Section 9.24.5) or comparing a row constructor to the output of a subquery (as in Section 9.23). In other contexts where two composite-type values are compared, two NULL field values are considered equal, and a NULL is considered larger than a non-NULL. This is necessary in order to have consistent sorting and indexing behavior for composite types.Bold emphasis mine.

edited Aug 27, 2024 at 21:52

answered Apr 2, 2015 at 10:34

Erwin Brandstetter

668k159 gold badges1.2k silver badges1.3k bronze badges

3 Comments

Nick Barnes Over a year ago

Rather surprisingly, Postgres appears to consider NULLs to be equal in the context of a record comparison - replacing the subquery's WHERE condition with (r.*) = (r2.*) AND r.ctid < r2.ctid seems to give the desired result even with NULLs present

Nick Barnes Over a year ago

And in any case, I think you should be able to cover the NULL case with IS NOT DISTINCT FROM comparisons, though it's a fairly revolting piece of syntax if you ask me...

Erwin Brandstetter Over a year ago

@NickBarnes: Surprising, indeed. I investigated and added an explanation.

Rahul Tripathi · Accepted Answer · 2015-04-02 09:17:13Z

0

You can try like this:

CREATE TABLE temp 
INSERT INTO temp SELECT DISTINCT * FROM discogs.releases_labels;
DROP TABLE discogs.releases_labels;
ALTER TABLE temp RENAME TO discogs.releases_labels;

answered Apr 2, 2015 at 9:17

Rahul Tripathi

173k33 gold badges292 silver badges341 bronze badges

5 Comments

Paul Taylor Over a year ago

Hi i was hoping for a single sql solution, but having that I wasn't aware of the RENAME function having that makes the INSERT INTO new table alot simpler, However I think your syntax is wrong it should be CREATE TABLE temp SELECT ... . Unfortunately because of this mistake Ive now dropped my discogs.releases_labels table and it is now empty, Because of the way the data is currently loaded one record at a time ( that I want to change) it going to take me 5 hours to get the data back. I think this sort of the problem is why I was relunctant to use the temp tables option :(

Erwin Brandstetter Over a year ago

@PaulTaylor: The above solution should be wrapped into a transaction with BEGIN; ... COMMIT; so your accident could not happen - at least the last two commands.

Paul Taylor Over a year ago

@ErwinBrandstetter yes I realize that now, but I just took the solution as is assuming it was correct, that is why I would have preferred a one line solution then then would be no problems with transactions

Paul Taylor Over a year ago

@Erwin There is another problem with this approach that I dont understand, I wanted to remove duplicates so that I could put on an index but just noticed when I try to do ALTER TABLE ONLY releases_labels ADD CONSTRAINT foreign_did FOREIGN KEY (release_id) REFERENCES release(id); it gives 'constraints on permanent tables may reference only permanent tables' - why is this - what is none permanent about this table.

Erwin Brandstetter Over a year ago

@PaulTaylor: You get this message when one of your tables is a temporary table. Be aware that temp tables hide plain tables of the same name while they exist. This answer is using the key word temp as identifier, which you shouldn't - to close to confusing syntax errors.

Serge Ballesta · Accepted Answer · 2015-04-02 10:43:58Z

0

As you have no primary key, there is no easy way to distinguish one duplicated line from any other one. That's one of the reasons why it is highly recommended that any table have a primary key (*).

So you are left with only 2 solutions :

use a temporary table as suggested by Rahul (IMHO the simpler and cleaner way) (**)

use procedural SQL and a cursor either from a procedural language such as Python or [put here your prefered language] or with PL/pgSQL. Something like (beware untested) :

CREATE OR REPLACE FUNCTION deduplicate() RETURNS integer AS $$
DECLARE
 curs CURSOR FOR SELECT * FROM releases_labels ORDER BY label, release_id, catno;
 r releases_labels%ROWTYPE;
 old releases_labels%ROWTYPE;
 n integer;
BEGIN
 n := 0;
 old := NULL;
 FOR rec IN curs LOOP
  r := rec;
  IF r = old THEN
   DELETE FROM releases_labels WHERE CURRENT OF curs;
   n := n + 1;
  END IF;
  old := rec;
 END LOOP;
 RETURN n;
END;
$$ LANGUAGE plpgsql;

SELECT deduplicate();

should delete duplicate lines and return the number of lines actually deleted. It is not necessarily the most efficient way, but you only touch rows that need to be deleted so you will not have to lock whole table.

(*) hopefully PostgreSQL offers the ctid pseudo column that you can use as a key. If you table contains an oid column, you can also use it as it will never change.

(**) PostgreSQL WITH allows you to do that in in single SQL statement

This two points from answer from Nick Barnes

edited Apr 2, 2015 at 10:43

answered Apr 2, 2015 at 10:26

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

2 Comments

user330315 Over a year ago

"You will not be able to do this with a simple SQL statement" not true: see Nick's answer.

Serge Ballesta Over a year ago

@a_horse_with_no_name : I looked at that question and erroneously thought that ctid was a primary key. I should have looked better at your answer :-)

joop · Accepted Answer · 2015-04-02 11:13:06Z

0

Since you also need to avoid duplicates in the future, you could add a surrogate key and a unique constraint while dedupping:

-- add surrogate key
ALTER TABLE releases_labels
        ADD column id SERIAL NOT NULL PRIMARY KEY
        ;

-- verify
SELECT * FROM releases_labels;

DELETE FROM releases_labels dd
WHERE EXISTS (SELECT *
        FROM releases_labels x
        WHERE x.label = dd.label
        AND x.release_id = dd.release_id
        AND x.catno = dd.catno
        AND x.id < dd.id
        );

-- verify
SELECT * FROM releases_labels;

-- add unique constraint for the natural key
ALTER TABLE releases_labels
        ADD UNIQUE (label,release_id,catno)
        ;

-- verify
SELECT * FROM releases_labels;

edited Apr 2, 2015 at 11:13

answered Apr 2, 2015 at 10:46

joop

4,5431 gold badge18 silver badges26 bronze badges

3 Comments

joop Over a year ago

Well, this is almost a duplicate (@!!) for @ErwinBrandsttter 's answer. Should I delete it, or leave it here for reference?

Erwin Brandstetter Over a year ago

Creating the serial column first is a variant that makes sense. So I would leave it.

joop Over a year ago

Ok, I'll leave it then. I had to check if the serial is populated at creation. It is. good...

Collectives™ on Stack Overflow

Delete duplicate rows from table with no unique key

5 Answers 5

4 Comments

Single SQL statement

Why does it work with NULL values?

3 Comments

5 Comments

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

Single SQL statement

Why does it work with NULL values?

3 Comments

5 Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related