Postgres select duplicate on one column but different on another

Question

I have a database with two columns:

author_id, message

And entries like:

123, "message!"
123, "message!"
123, "different message"
124, "message!"

I want to do a query that allows me to select either:

123, "message!"

or

124, "message!"

Essentially, entries where the message is the same, but the author_id is different.

I then want to delete one of these entries. (it doesn't matter which one, just that I can select only one of them).

This question gets me close but it's for duplicates across two columns.

What is different authors have multiple common messages (for ex. both author_id 123 and 124 have "message2" )? then what is desirable result ? — Oto Shavadze
– Oto Shavadze, Commented Jun 16, 2017 at 10:51
@OtoShavadze Same, just select one of them. If the same author has two duplicates and a second author has one, any of the three works. — goddamnyouryan
– goddamnyouryan, Commented Jun 16, 2017 at 10:58
Is there a primary key for this table? -- if the solution happens to select a 123, 'message!' row to delete, should it delete all of these rows? — pozs
– pozs, Commented Jun 16, 2017 at 11:03
@pozs there is indeed a primary key. It should delete all of them except for one. — goddamnyouryan
– goddamnyouryan, Commented Jun 16, 2017 at 11:11

Kristo Mägi · Accepted Answer · 2017-06-19 08:57:00Z

3

And one more alternative example:

-- Test table
CREATE TABLE dummy_data (
    author_id   int,
    message     text
);

-- Test data
INSERT INTO dummy_data ( author_id, message )
VALUES
( 123, '"message!"' ),
( 123, '"message!"' ),
( 123, '"different message"' ),
( 124, '"message!"' ),
( 124, '"message!"' ),
( 125, '"message!"' );

-- Delete query
DELETE FROM dummy_data
WHERE   ctid NOT IN (
            SELECT  max( ctid )
            FROM    dummy_data
            GROUP BY message     -- this is important to specify
        )
 -- just for test returning deleted records,
 -- you may ignore it, if don't want
RETURNING *;

-- Confirming result:
SELECT * FROM dummy_data ;
 author_id |       message
-----------+---------------------
       123 | "different message"
       125 | "message!"
(2 rows)

See more about system columns: https://www.postgresql.org/docs/current/static/ddl-system-columns.html

EDIT:
Additional example as was requested limiting the range by IDs (author_id).

Pure query:

DELETE FROM dummy_data
USING   ( SELECT ARRAY[ 123, 124] ) v(id)
WHERE   author_id = ANY ( v.id )
AND     ctid NOT IN (
            SELECT  max( ctid )
            FROM    dummy_data
            WHERE   author_id = ANY ( v.id )
            GROUP BY message
        );

Same query with comments:

DELETE FROM dummy_data
-- Add your 'author_id' values into array here.
-- Reason we list it here with USING statement is
-- because we need to compare values in two places
-- and if list is too big it would be annoyance to
-- write it 2 times :)
USING   ( SELECT ARRAY[ 123, 124] ) v(id)
-- First we get all the authors in the batch by ID
WHERE   author_id = ANY ( v.id )
-- Secondly we get max CTID to ignore using same
-- authors range in batch scope
AND     ctid NOT IN (
            SELECT  max( ctid )
            FROM    dummy_data
            WHERE   author_id = ANY ( v.id )
            GROUP BY message
        );

-- This will delete following rows:
 author_id |  message
-----------+------------
       123 | "message!"
       123 | "message!"
       124 | "message!"
(3 rows)

-- Leaving the state to table:
 author_id |       message
-----------+---------------------
       123 | "different message"
       124 | "message!"
       125 | "message!"
(3 rows)

edited Jun 19, 2017 at 8:57

answered Jun 16, 2017 at 11:15

Kristo Mägi

1,70413 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

goddamnyouryan Over a year ago

This is great, but also a bit slow. I have about 100 million rows in my database I am doing this on, so it would be great to be able to scope it as well, for example delete the duplicates, but only within a specific subset, say ones that are in the array of author_ids [123,124]. How could you modify this query to handle that?

Kristo Mägi Over a year ago

But if you do it by author_ids then with the case of [123,124] the value of 124 will remain. But then if you feed [125,126]' then that's new look that doesn't know anything about last "batch". Meaning 124 "message!"` will remain even though the "message!" is duplicate to 125. Is that ok for you? If yes, I can easily edit the example :)

goddamnyouryan Over a year ago

Yes that's fine. Essentially I have groups of authors assigned in groups together, and I want to make sure there are no duplicate messages within those groups. Does that make sense?

Kristo Mägi Over a year ago

Added additional example. See the EDIT section :)

klin · Accepted Answer · 2017-06-16 11:01:43Z

1

You can use array_agg() for this, e.g.:

select author_id, message
from (
    select message, array_agg(distinct author_id) ids
    from my_table
    group by message
    ) s
cross join unnest(ids) author_id
where cardinality(ids) > 1
order by author_id;

 author_id | message  
-----------+----------
       123 | message!
       124 | message!
(2 rows)

If you want to get a single row for multiplied messages, the query may be simpler:

select min(author_id) as author_id, message
from my_table
group by message
having count(distinct author_id) > 1;

 author_id | message  
-----------+----------
       123 | message!
(1 row)

edited Jun 16, 2017 at 11:01

answered Jun 16, 2017 at 10:55

klin

123k15 gold badges240 silver badges262 bronze badges

1 Comment

goddamnyouryan Over a year ago

For your second option, I really like it, it's very simple. Would it be possible to select the id column as well? If I add it to select, I also have to group it, and then the query no longer works properly.

Oto Shavadze · Accepted Answer · 2017-06-16 11:07:03Z

If I correctly understand, you need something like this:

with the_table (author_id, message) as (
    select 123, '"message!"' union all
    select 123, '"message!"' union all
    select 123, '"aaa!"' union all
    select 123, '"different message"' union all
    select 124, '"aaa!"' union all
    select 124, '"message!"'  union all
    select 125, '"aaa!"' union all
    select 125, '"rrrr!"'  
)


select the_table.* from  the_table 
join ( 
    select message from the_table
    group by message
    having count(distinct author_id) = (select count(distinct author_id) from the_table)
) t
on the_table.message = t.message
order by random() limit 1

Randomly gets one user with message, which is common for all author_id's

Collectives™ on Stack Overflow

Postgres select duplicate on one column but different on another

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related