3

I can very easily remove duplicates in an array by using .uniq, but how would I go about doing it without using the .uniq method?

4
  • Could you add the reason you want to know? Just curious? Commented Dec 26, 2015 at 20:21
  • 1
    Hey there! I'm currently learning Ruby/Web dev stuff, so I have some challenges I'm going through and this one is stumping me like crazy. Commented Dec 26, 2015 at 20:22
  • It doesn't remove uniques. It retains the uniques. Commented Dec 26, 2015 at 20:45
  • Why the rush to select an answer? Commented Dec 26, 2015 at 21:34

5 Answers 5

5
a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]

class Array
  def my_uniq
    self | []
  end
end

a.my_uniq
  #=> [1, 2, 4, 3, 5, 6]

This uses the method Array#|: "Set Union — Returns a new array by joining ary with other_ary, excluding any duplicates and preserving the order from the original array."

Here is a benchmark for the various answers, as well as Array#uniq.

require 'fruity'
require 'set'

def doit(n, m)
  arr = n.times.to_a
  arr = m.times.map { arr.sample }
  compare do
    uniq     { arr.uniq } 
    Schwern  { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
    Sharma   {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
    Mihael   { arr.to_set.to_a }
    sawa     { arr.group_by(&:itself).keys }
    Cary     { arr | [] }
  end
end

doit(1_000, 500)
# Schwern is faster than uniq by 19.999999999999996% ± 10.0% (results differ)
# uniq is similar to Cary
# Cary is faster than Mihael by 10.000000000000009% ± 10.0%
# Mihael is similar to sawa
# sawa is faster than Sharma by 5x ± 0.1

doit(100_000, 50_000)
# Schwern is faster than uniq by 50.0% ± 10.0%               (results differ)
# uniq is similar to Cary
# Cary is similar to Mihael
# Mihael is faster than sawa by 10.000000000000009% ± 10.0%
# sawa is faster than Sharma by 310x ± 10.0

"Schwern" and "uniq" return arrays containing the same elements but not in the same order (hence "results differ").

Here is the additional benchmark requested by @Schern.

def doit1(n)
  arr = n.times.map { rand(n/10) }
  compare do
    uniq     { arr.uniq } 
    Schwern  { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
    Sharma   {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
    Mihael   { arr.to_set.to_a }
    sawa     { arr.group_by(&:itself).keys }
    Cary     { arr | [] }
  end
end

doit1(1_000)
# Cary is similar to uniq
# uniq is faster than sawa by 3x ± 1.0
# sawa is similar to Schwern                     (results differ)
# Schwern is similar to Mihael                   (results differ)
# Mihael is faster than Sharma by 2x ± 0.1

doit1(50_000)
# Cary is similar to uniq
# uniq is faster than Schwern by 2x ± 1.0        (results differ)
# Schwern is similar to Mihael                   (results differ)
# Mihael is similar to sawa
# sawa is faster than Sharma by 62x ± 10.0
Sign up to request clarification or add additional context in comments.

4 Comments

Fancy as always. (: Still, I think OP's question was more in terms of how to implement it.
I meant how to implement it in algorithmic terms, not how to replace with another Ruby sugar :) I might be wrong though.
Very clever, and very obfuscated. Ultimately it's doing the same thing as Array#uniq. Array#| puts the first array in a hash, loops through the elements of the second array adding them if they're not already there, and then returns the keys of the hash. Since the second array is empty, this just puts self into a hash and returns the keys, just like Array#uniq.
Turns out performance is very sensitive to the input. Your input has few duplicates. If there's a lot of duplicates it all changes. I don't know why. Try arr = m.times.map { rand(m/10) }.
4

The code for most Ruby methods can be found in the ruby-doc.org API documentation. If you mouse over a method's documentation, a "click to toggle source" button appears. The code is in C, but it's very easy to understand.

if (RARRAY_LEN(ary) <= 1)
    return rb_ary_dup(ary);

if (rb_block_given_p()) {
    hash = ary_make_hash_by(ary);
    uniq = rb_hash_values(hash);
}
else {
    hash = ary_make_hash(ary);
    uniq = rb_hash_values(hash);
}

If there's one element, return it. Otherwise turn the elements into hash keys, turn the hash back into an array. By a documented quirk of Ruby hashes, "Hashes enumerate their values in the order that the corresponding keys were inserted", this technique preserves the original order of the elements in the Array. In other languages it may not.

Alternatively, use a Set. A set will never have duplicates. Loading set adds the method to_set to all Enumerable objects, which includes Arrays. However, a Set is usually implemented as a Hash so you're doing the same thing. If you want a unique array, and if you don't need the elements to be ordered, you should probably instead make a set and use that. unique = array.to_set

Alternatively, sort the Array and loop through it pushing each element onto a new Array. If the last element on the new Array matches the current element, discard it.

array = [2, 3, 4, 5, 1, 2, 4, 5];
uniq = []

# This copies the whole array and the duplicates, wasting
# memory.  And sort is O(nlogn).
array.sort.each { |e|
  uniq.push(e) if e != uniq[-1]
}

[1, 2, 3, 4, 5]
puts uniq.inspect

This method is to be avoided because it is slower and takes more memory than the other methods. The sort makes it slower. Sorting is O(nlogn) meaning as the array gets bigger sorting will get slower quicker than the array grows. It also requires you to copy the whole array, with duplicates, unless you want to alter the original data by sorting in place with sort!.

The other methods are O(n) speed and O(n) memory meaning they will scale linearly as the array gets bigger. And they don't have to copy the duplicates which can use substantially less memory.

8 Comments

Thank you for being so descriptive! Was incredibly helpful :)
Array#uniq maintains the order of the elements that are not removed from the array. Your solution does not. That may or may not be a problem. The other solutions, incidentally, maintain the order of the remaining elements.
@CarySwoveland Using a hash will preserve the original order, but only because of a documented quirk of Ruby hashes. Ruby hashes "enumerate their values in the order that the corresponding keys were inserted" and their Array#uniq implementation takes advantage of that. Using a Set will lose the order, but I noted that caveat. Sorting will also lose the original order, but that's a bad way to do it anyway.
I meant to limit my remark to the method that first sorts the array (which is very fast, incidentally). I believe use of a set does preserve order, for the same reason that a hash does. For example, require 'set'; a = 0.upto(100_000).to_a; b = 50_000.times.map { a.sample }; b.to_set.to_a == b.uniq #=> true.
@CarySwoveland I hadn't realized Ruby hashes preserved order until you commented. As for set order, it should not be relied on. Set#to_a documents "the order of elements is uncertain." As for the speed of the sort technique, it's O(nlogn). While it may edge out the hash implementation for small arrays, it rapidly gets worse as the array grows. At 10 elements Array#uniq and sort are the same speed. At 100 sort is already 3 times slower. At 1000 it's 6 times slower. Using a set is also slow. pastebin.com/2QaLc6Jn
|
3

You can use #to_set Read more about it here

2 Comments

Good suggestion, but why not post a complete answer?
Similar would be doing something like Set.new [1,2,3,3,3,4,4], This returns a Set of unique objects.
2
array.group_by(&:itself).keys

......................

2 Comments

How would you do it without using a hash?
@DamianoStoffie I added a method that does not use a Hash to my answer, but it is slower and takes more memory.
1

You can also try this, check following example.

a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]

b = []

a.each{ |aa| b << aa unless b.include?(aa) }

# when you check b you will get following result.

[1, 2, 4, 3, 5, 6]

alternatively you can also try following

a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]

b = a & a

# OR

b = a | a

# both will return following result

[1, 2, 4, 3, 5, 6]

3 Comments

This is very slow, O(n^2), meaning if the array size doubles the run time goes up by 4. The problem is b.include?(aa). That has to potentially search all of b every time through the loop. A loop with an O(n) operation inside it is O(n^2). This is why it's better for b to be a Hash. Hash lookups are O(1) and they cannot have duplicates. Rule of thumb: any time you write array.include? ask yourself "can I use a Hash instead?"
Yes, repeatedly invoking include? is indeed wasteful. You should ask yourself one of two questions. One is given by @Schwern. The other is: "Should I instead create a set and then convert it to an array?
@CarySwoveland A Ruby set is an interface around a hash. It's probably faster to avoid the set interface overhead and just use the hash, but they'll both be O(n) solutions. If you're just going to throw away the set, I'd use a hash. If you're going to hold onto the unique array, I'd not bother with the array and instead make and use a set.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.