1

I have a database with two columns:

author_id, message

And entries like:

123, "message!"
123, "message!"
123, "different message"
124, "message!"

I want to do a query that allows me to select either:

123, "message!"

or

124, "message!"

Essentially, entries where the message is the same, but the author_id is different.

I then want to delete one of these entries. (it doesn't matter which one, just that I can select only one of them).

This question gets me close but it's for duplicates across two columns.

4
  • 1
    What is different authors have multiple common messages (for ex. both author_id 123 and 124 have "message2" )? then what is desirable result ? Commented Jun 16, 2017 at 10:51
  • @OtoShavadze Same, just select one of them. If the same author has two duplicates and a second author has one, any of the three works. Commented Jun 16, 2017 at 10:58
  • 1
    Is there a primary key for this table? -- if the solution happens to select a 123, 'message!' row to delete, should it delete all of these rows? Commented Jun 16, 2017 at 11:03
  • @pozs there is indeed a primary key. It should delete all of them except for one. Commented Jun 16, 2017 at 11:11

3 Answers 3

3

And one more alternative example:

-- Test table
CREATE TABLE dummy_data (
    author_id   int,
    message     text
);

-- Test data
INSERT INTO dummy_data ( author_id, message )
VALUES
( 123, '"message!"' ),
( 123, '"message!"' ),
( 123, '"different message"' ),
( 124, '"message!"' ),
( 124, '"message!"' ),
( 125, '"message!"' );

-- Delete query
DELETE FROM dummy_data
WHERE   ctid NOT IN (
            SELECT  max( ctid )
            FROM    dummy_data
            GROUP BY message     -- this is important to specify
        )
 -- just for test returning deleted records,
 -- you may ignore it, if don't want
RETURNING *;

-- Confirming result:
SELECT * FROM dummy_data ;
 author_id |       message
-----------+---------------------
       123 | "different message"
       125 | "message!"
(2 rows)

See more about system columns: https://www.postgresql.org/docs/current/static/ddl-system-columns.html

EDIT:
Additional example as was requested limiting the range by IDs (author_id).

Pure query:

DELETE FROM dummy_data
USING   ( SELECT ARRAY[ 123, 124] ) v(id)
WHERE   author_id = ANY ( v.id )
AND     ctid NOT IN (
            SELECT  max( ctid )
            FROM    dummy_data
            WHERE   author_id = ANY ( v.id )
            GROUP BY message
        );

Same query with comments:

DELETE FROM dummy_data
-- Add your 'author_id' values into array here.
-- Reason we list it here with USING statement is
-- because we need to compare values in two places
-- and if list is too big it would be annoyance to
-- write it 2 times :)
USING   ( SELECT ARRAY[ 123, 124] ) v(id)
-- First we get all the authors in the batch by ID
WHERE   author_id = ANY ( v.id )
-- Secondly we get max CTID to ignore using same
-- authors range in batch scope
AND     ctid NOT IN (
            SELECT  max( ctid )
            FROM    dummy_data
            WHERE   author_id = ANY ( v.id )
            GROUP BY message
        );

-- This will delete following rows:
 author_id |  message
-----------+------------
       123 | "message!"
       123 | "message!"
       124 | "message!"
(3 rows)

-- Leaving the state to table:
 author_id |       message
-----------+---------------------
       123 | "different message"
       124 | "message!"
       125 | "message!"
(3 rows)
Sign up to request clarification or add additional context in comments.

4 Comments

This is great, but also a bit slow. I have about 100 million rows in my database I am doing this on, so it would be great to be able to scope it as well, for example delete the duplicates, but only within a specific subset, say ones that are in the array of author_ids [123,124]. How could you modify this query to handle that?
But if you do it by author_ids then with the case of [123,124] the value of 124 will remain. But then if you feed [125,126]' then that's new look that doesn't know anything about last "batch". Meaning 124 "message!"` will remain even though the "message!" is duplicate to 125. Is that ok for you? If yes, I can easily edit the example :)
Yes that's fine. Essentially I have groups of authors assigned in groups together, and I want to make sure there are no duplicate messages within those groups. Does that make sense?
Added additional example. See the EDIT section :)
1

You can use array_agg() for this, e.g.:

select author_id, message
from (
    select message, array_agg(distinct author_id) ids
    from my_table
    group by message
    ) s
cross join unnest(ids) author_id
where cardinality(ids) > 1
order by author_id;

 author_id | message  
-----------+----------
       123 | message!
       124 | message!
(2 rows)

If you want to get a single row for multiplied messages, the query may be simpler:

select min(author_id) as author_id, message
from my_table
group by message
having count(distinct author_id) > 1;

 author_id | message  
-----------+----------
       123 | message!
(1 row)

1 Comment

For your second option, I really like it, it's very simple. Would it be possible to select the id column as well? If I add it to select, I also have to group it, and then the query no longer works properly.
1

If I correctly understand, you need something like this:

with the_table (author_id, message) as (
    select 123, '"message!"' union all
    select 123, '"message!"' union all
    select 123, '"aaa!"' union all
    select 123, '"different message"' union all
    select 124, '"aaa!"' union all
    select 124, '"message!"'  union all
    select 125, '"aaa!"' union all
    select 125, '"rrrr!"'  
)


select the_table.* from  the_table 
join ( 
    select message from the_table
    group by message
    having count(distinct author_id) = (select count(distinct author_id) from the_table)
) t
on the_table.message = t.message
order by random() limit 1

Randomly gets one user with message, which is common for all author_id's

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.