Remove duplicates in Ruby Array

Question

I can very easily remove duplicates in an array by using .uniq, but how would I go about doing it without using the .uniq method?

Hey there! I'm currently learning Ruby/Web dev stuff, so I have some challenges I'm going through and this one is stumping me like crazy. — Dylan Bailey
– Dylan Bailey, Commented Dec 26, 2015 at 20:22

Cary Swoveland · Accepted Answer · 2015-12-27 08:46:41Z

5

a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]

class Array
  def my_uniq
    self | []
  end
end

a.my_uniq
  #=> [1, 2, 4, 3, 5, 6]

This uses the method Array#|: "Set Union — Returns a new array by joining ary with other_ary, excluding any duplicates and preserving the order from the original array."

Here is a benchmark for the various answers, as well as Array#uniq.

require 'fruity'
require 'set'

def doit(n, m)
  arr = n.times.to_a
  arr = m.times.map { arr.sample }
  compare do
    uniq     { arr.uniq } 
    Schwern  { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
    Sharma   {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
    Mihael   { arr.to_set.to_a }
    sawa     { arr.group_by(&:itself).keys }
    Cary     { arr | [] }
  end
end

doit(1_000, 500)
# Schwern is faster than uniq by 19.999999999999996% ± 10.0% (results differ)
# uniq is similar to Cary
# Cary is faster than Mihael by 10.000000000000009% ± 10.0%
# Mihael is similar to sawa
# sawa is faster than Sharma by 5x ± 0.1

doit(100_000, 50_000)
# Schwern is faster than uniq by 50.0% ± 10.0%               (results differ)
# uniq is similar to Cary
# Cary is similar to Mihael
# Mihael is faster than sawa by 10.000000000000009% ± 10.0%
# sawa is faster than Sharma by 310x ± 10.0

"Schwern" and "uniq" return arrays containing the same elements but not in the same order (hence "results differ").

Here is the additional benchmark requested by @Schern.

def doit1(n)
  arr = n.times.map { rand(n/10) }
  compare do
    uniq     { arr.uniq } 
    Schwern  { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
    Sharma   {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
    Mihael   { arr.to_set.to_a }
    sawa     { arr.group_by(&:itself).keys }
    Cary     { arr | [] }
  end
end

doit1(1_000)
# Cary is similar to uniq
# uniq is faster than sawa by 3x ± 1.0
# sawa is similar to Schwern                     (results differ)
# Schwern is similar to Mihael                   (results differ)
# Mihael is faster than Sharma by 2x ± 0.1

doit1(50_000)
# Cary is similar to uniq
# uniq is faster than Schwern by 2x ± 1.0        (results differ)
# Schwern is similar to Mihael                   (results differ)
# Mihael is similar to sawa
# sawa is faster than Sharma by 62x ± 10.0

edited Dec 27, 2015 at 8:46

answered Dec 26, 2015 at 21:32

Cary Swoveland

111k6 gold badges69 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ndnenkov Over a year ago

Fancy as always. (: Still, I think OP's question was more in terms of how to implement it.

ndnenkov Over a year ago

I meant how to implement it in algorithmic terms, not how to replace with another Ruby sugar :) I might be wrong though.

Schwern Over a year ago

Very clever, and very obfuscated. Ultimately it's doing the same thing as Array#uniq. Array#| puts the first array in a hash, loops through the elements of the second array adding them if they're not already there, and then returns the keys of the hash. Since the second array is empty, this just puts self into a hash and returns the keys, just like Array#uniq.

Schwern Over a year ago

Turns out performance is very sensitive to the input. Your input has few duplicates. If there's a lot of duplicates it all changes. I don't know why. Try arr = m.times.map { rand(m/10) }.

Schwern · Accepted Answer · 2015-12-27 02:16:38Z

4

The code for most Ruby methods can be found in the ruby-doc.org API documentation. If you mouse over a method's documentation, a "click to toggle source" button appears. The code is in C, but it's very easy to understand.

if (RARRAY_LEN(ary) <= 1)
    return rb_ary_dup(ary);

if (rb_block_given_p()) {
    hash = ary_make_hash_by(ary);
    uniq = rb_hash_values(hash);
}
else {
    hash = ary_make_hash(ary);
    uniq = rb_hash_values(hash);
}

If there's one element, return it. Otherwise turn the elements into hash keys, turn the hash back into an array. By a documented quirk of Ruby hashes, "Hashes enumerate their values in the order that the corresponding keys were inserted", this technique preserves the original order of the elements in the Array. In other languages it may not.

Alternatively, use a Set. A set will never have duplicates. Loading set adds the method to_set to all Enumerable objects, which includes Arrays. However, a Set is usually implemented as a Hash so you're doing the same thing. If you want a unique array, and if you don't need the elements to be ordered, you should probably instead make a set and use that. unique = array.to_set

Alternatively, sort the Array and loop through it pushing each element onto a new Array. If the last element on the new Array matches the current element, discard it.

array = [2, 3, 4, 5, 1, 2, 4, 5];
uniq = []

# This copies the whole array and the duplicates, wasting
# memory.  And sort is O(nlogn).
array.sort.each { |e|
  uniq.push(e) if e != uniq[-1]
}

[1, 2, 3, 4, 5]
puts uniq.inspect

This method is to be avoided because it is slower and takes more memory than the other methods. The sort makes it slower. Sorting is O(nlogn) meaning as the array gets bigger sorting will get slower quicker than the array grows. It also requires you to copy the whole array, with duplicates, unless you want to alter the original data by sorting in place with sort!.

The other methods are O(n) speed and O(n) memory meaning they will scale linearly as the array gets bigger. And they don't have to copy the duplicates which can use substantially less memory.

edited Dec 27, 2015 at 2:16

answered Dec 26, 2015 at 20:28

Schwern

167k28 gold badges225 silver badges370 bronze badges

8 Comments

Dylan Bailey Over a year ago

Thank you for being so descriptive! Was incredibly helpful :)

Cary Swoveland Over a year ago

Array#uniq maintains the order of the elements that are not removed from the array. Your solution does not. That may or may not be a problem. The other solutions, incidentally, maintain the order of the remaining elements.

Schwern Over a year ago

@CarySwoveland Using a hash will preserve the original order, but only because of a documented quirk of Ruby hashes. Ruby hashes "enumerate their values in the order that the corresponding keys were inserted" and their Array#uniq implementation takes advantage of that. Using a Set will lose the order, but I noted that caveat. Sorting will also lose the original order, but that's a bad way to do it anyway.

Cary Swoveland Over a year ago

I meant to limit my remark to the method that first sorts the array (which is very fast, incidentally). I believe use of a set does preserve order, for the same reason that a hash does. For example, require 'set'; a = 0.upto(100_000).to_a; b = 50_000.times.map { a.sample }; b.to_set.to_a == b.uniq #=> true.

Schwern Over a year ago

@CarySwoveland I hadn't realized Ruby hashes preserved order until you commented. As for set order, it should not be relied on. Set#to_a documents "the order of elements is uncertain." As for the speed of the sort technique, it's O(nlogn). While it may edge out the hash implementation for small arrays, it rapidly gets worse as the array grows. At 10 elements Array#uniq and sort are the same speed. At 100 sort is already 3 times slower. At 1000 it's 6 times slower. Using a set is also slow. pastebin.com/2QaLc6Jn

|

Mihail Petkov · Accepted Answer · 2015-12-26 20:23:01Z

3

You can use #to_set Read more about it here

answered Dec 26, 2015 at 20:23

Mihail Petkov

1,54511 silver badges11 bronze badges

2 Comments

Cary Swoveland Over a year ago

Good suggestion, but why not post a complete answer?

Raman Over a year ago

Similar would be doing something like Set.new [1,2,3,3,3,4,4], This returns a Set of unique objects.

sawa · Accepted Answer · 2015-12-26 20:44:26Z

2

array.group_by(&:itself).keys

......................

answered Dec 26, 2015 at 20:44

sawa

169k51 gold badges287 silver badges398 bronze badges

2 Comments

Damiano Stoffie Over a year ago

How would you do it without using a hash?

Schwern Over a year ago

@DamianoStoffie I added a method that does not use a Hash to my answer, but it is slower and takes more memory.

Amit Sharma · Accepted Answer · 2015-12-28 07:49:28Z

1

You can also try this, check following example.

a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]

b = []

a.each{ |aa| b << aa unless b.include?(aa) }

# when you check b you will get following result.

[1, 2, 4, 3, 5, 6]

alternatively you can also try following

a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]

b = a & a

# OR

b = a | a

# both will return following result

[1, 2, 4, 3, 5, 6]

edited Dec 28, 2015 at 7:49

answered Dec 26, 2015 at 20:40

Amit Sharma

3,4772 gold badges17 silver badges21 bronze badges

3 Comments

Schwern Over a year ago

This is very slow, O(n^2), meaning if the array size doubles the run time goes up by 4. The problem is b.include?(aa). That has to potentially search all of b every time through the loop. A loop with an O(n) operation inside it is O(n^2). This is why it's better for b to be a Hash. Hash lookups are O(1) and they cannot have duplicates. Rule of thumb: any time you write array.include? ask yourself "can I use a Hash instead?"

Cary Swoveland Over a year ago

Yes, repeatedly invoking include? is indeed wasteful. You should ask yourself one of two questions. One is given by @Schwern. The other is: "Should I instead create a set and then convert it to an array?

Schwern Over a year ago

@CarySwoveland A Ruby set is an interface around a hash. It's probably faster to avoid the set interface overhead and just use the hash, but they'll both be O(n) solutions. If you're just going to throw away the set, I'd use a hash. If you're going to hold onto the unique array, I'd not bother with the array and instead make and use a set.

Collectives™ on Stack Overflow

Remove duplicates in Ruby Array

5 Answers 5

4 Comments

8 Comments

2 Comments

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

8 Comments

2 Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related