132

What is the best way to find records with duplicate values across multiple columns using Postgres, and Activerecord?

I found this solution here:

User.find(:all, :group => [:first, :email], :having => "count(*) > 1" )

But it doesn't seem to work with postgres. I'm getting this error:

PG::GroupingError: ERROR: column "parts.id" must appear in the GROUP BY clause or be used in an aggregate function

1
  • 3
    In regular SQL, I'd use a self-join, something like select a.id, b.id, name, email FROM user a INNER JOIN user b USING (name, email) WHERE a.id > b.id. No idea how to express that in ActiveRecord-speak. Commented Feb 10, 2014 at 4:48

8 Answers 8

297

Tested & Working Version

User.select(:first,:email).group(:first,:email).having("count(*) > 1")

Also, this is a little unrelated but handy. If you want to see how times each combination was found, put .size at the end:

User.select(:first,:email).group(:first,:email).having("count(*) > 1").size

and you'll get a result set back that looks like this:

{[nil, nil]=>512,
 ["Joe", "[email protected]"]=>23,
 ["Jim", "[email protected]"]=>36,
 ["John", "[email protected]"]=>21}

Thought that was pretty cool and hadn't seen it before.

Credit to Taryn, this is just a tweaked version of her answer.

Sign up to request clarification or add additional context in comments.

10 Comments

I had to pass an explict array to select() as in: User.select([:first,:email]).group(:first,:email).having("count(*) > 1").count in order to work.
adding the .count gives PG::UndefinedFunction: ERROR: function count
You can try User.select([:first,:email]).group(:first,:email).having("count(*) > 1").map.count
I'm trying the same method but trying to get the User.id as well, adding it to the select and group returns an empty array. How can I return the whole User model, or at least include the :id?
use .sizeinstead of .count
|
44

That error occurs because POSTGRES requires you to put grouping columns in the SELECT clause.

try:

User.select(:first,:email).group(:first,:email).having("count(*) > 1").all

(note: not tested, you may need to tweak it)

EDITED to remove id column

2 Comments

That's not going to work; the id column is not part of the group, so you cannot refer it unless you aggregate it (e.g. array_agg(id) or json_agg(id))
Just to add onto the comment. the above would become User.select("arrag_agg(id) as ids").select(:first,:email).group(:first,:email).having("count(*) > 1").
20

If you need the full models, try the following (based on @newUserNameHere's answer).

User.where(email: User.select(:email).group(:email).having("count(*) > 1").select(:email))

This will return the rows where the email address of the row is not unique.

I'm not aware of a way to do this over multiple attributes.

3 Comments

``` User.where(email: User.select(:email).group(:email).having("count(*) > 1")) ```
Thank you that works great :) Also seems like it the last .select(:email) is redundant. I think this is a little cleaner, but I could be wrong. User.where(email: User.select(:email).group(:email).having("count(*) > 1"))
perfect! this solution finds ActiveRecord instances. just what I was looking for
8

Get all duplicates with a single query if you use PostgreSQL:

def duplicated_users
  duplicated_ids = User
    .group(:first, :email)
    .having("COUNT(*) > 1")
    .select('unnest((array_agg("id"))[2:])')

  User.where(id: duplicated_ids)
end

irb> duplicated_users

Comments

3

I struggled to get proper User models returned via the accepted answer. Here's how:

User
  .group(:first, :email)
  .having("COUNT(*) > 1")
  .select('array_agg("id") as ids')
  .map(&:ids)
  .map { |group| group.map { |id| User.find(id) } }

This will return proper models you can interact with as:

[
  [User#1, User#2],
  [User#35, User#59],
]

Comments

0

Works well in raw SQL:

# select array_agg(id) from attendances group by event_id, user_id having count(*) > 1;
   array_agg   
---------------
 {3712,3711}
 {8762,8763}
 {7421,7420}
 {13478,13477}
 {15494,15493}

Comments

0

Building on @itsnikolay 's answer above but making a method that you can pass any ActiveRecord scope to

#pass in a scope, and list of columns to group by
# map(&:dupe_ids) to see your list 
def duplicate_row_ids(ar_scope, attrs)
  ar_scope
    .group(attrs)
    .having("COUNT(*) > 1")
    .select('array_agg("id") as dupe_ids')      
end

 #initial scope to narrow where you want to look for dupes
 ar_scope = ProductReviews.where( product_id: "194e676b-741e-4143-a0ce-10cf268290bb", status: "Rejected")
#pass the scope, and list of columns to group by
results = duplicate_row_ids(ar_scope, [:nickname, :overall_rating, :source, :product_id, :headline, :status])
#get your list
id_pairs = results.map &:dupe_ids
#each entry is an array
#then go through your pairs and take action

Comments

-1

Based on the answer above by @newUserNameHere I believe the right way to show the count for each is

res = User.select('first, email, count(1)').group(:first,:email).having('count(1) > 1')

res.each {|r| puts r.attributes } ; nil

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.