ruby CSV duplicate row parsing

Question

I have some CSV data I need to process, and having trouble figuring out a way to match the duplicates.

data looks a bit like this:

line    id    name   item_1    item_2    item_3    item_4
1      251   john    foo       foo       foo       foo
2      251   john    foo       bar       bar       bar
3      251   john    foo       bar       baz       baz
4      251   john    foo       bar       baz       pat

lines 1-3 are duplicates in this case.

line    id    name   item_1    item_2    item_3    item_4
5      347   bill    foo       foo       foo       foo
6      347   bill    foo       bar       bar       bar

in this case only line 5 is a duplicate

line    id    name   item_1    item_2    item_3    item_4
7      251   mary    foo       foo       foo       foo
8      251   mary    foo       bar       bar       bar
9      251   mary    foo       bar       baz       baz

here lines 7 and 8 are the duplicates

so basically if the pattern adds a new "item" the previous line is a duplicate. I want to end up with a single line for each person, regardless of how many items they have

I am using Ruby 1.9.3 like this:

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')

CSV.open("output-file", "wb") do |csv|
    #write the first row (header) to the output file
    csv << people[0]
    people.each do |p|
        ... logic to test for dupe ...
        csv << p.unique
    end
end

Can you clarify what you mean by "duplicate"? I'm not sure that's the correct word to use here, as duplicate typically means an exact copy of data, so a single line can't be a duplicate, but rather one line would be a duplicate of another; however, from your example, you're not talking about duplicate lines so it has to do with the data, but it's not obvious what you mean. — Ben Taitelbaum
– Ben Taitelbaum, Commented Mar 7, 2012 at 13:29
yeah, I suppose unique-ified or condensed or something :) sorry for the confusion — sysconfig
– sysconfig, Commented Mar 7, 2012 at 13:38
Are you just trying to find the unique list of people, or are you looking to have a list of items associated with them? What are the rules to determine whether the list of items causes a line to be a duplicate? Would your results depend on the order of the lines in the file? — Marc Talbot
– Marc Talbot, Commented Mar 7, 2012 at 14:07
I'm still confused. How are lines 6 and 9 unique when they each contain duplicate items? — Ben Taitelbaum
– Ben Taitelbaum, Commented Mar 7, 2012 at 22:40

Derek Harmel · Accepted Answer · 2012-03-07 14:12:39Z

First, there's a slight bug with your code. Instead of:

csv << people[0]

You would need to do the following if you don't want to change your loop code:

csv << people.shift

Now, the following solution will add only the first occurrence of a person, discarding any subsequent duplicates as determined by id (as I am assuming ids are unique).

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
ids = [] # or you could use a Set

CSV.open("output-file", "wb") do |csv|
  #write the first row (header) to the output file
  csv << people.shift
  people.each do |p|
    # If the id of the current records is in the ids array, we've already seen 
    # this person
    next if ids.include?(p[0])

    # Now add the new id to the front of the ids array since the example you gave
    # the duplicate records directly follow the original, this will be slightly
    # faster than if we added the array to the end, but above we still check the
    # entire array to be safe
    ids.unshift p[0]
    csv << p
  end
end

Note that there is a more performant solution if your duplicate records always directly follow the original, you would only need to keep the last original id and check the current record's id rather than inclusion in an entire array. The difference may be negligible if your input file doesn't contain many records.

That would look like this:

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
previous_id = nil

CSV.open("output-file", "wb") do |csv|
  #write the first row (header) to the output file
  csv << people.shift
  people.each do |p|
    next if p[0] == previous_id
    previous_id = p[0]
    csv << p
  end
end

this is most like what I finally ended up implementing. Thanks to all for the suggestions.

glenn mcdonald · Accepted Answer · 2012-03-07 18:15:43Z

1

It sounds like you're trying to get a list of unique items associated with each person, where a person is identified by an id and a name. If that's right, you can do something like this:

peoplehash = {}
maxitems = 0
people.each do |id, name, *items|
    (peoplehash[[id, name]] ||= []) += items
peoplehash.keys.each do |k|
    peoplehash[k].uniq!
    peoplehash[k].sort!
    maxitems = [maxitems, peoplehash[k].size].max

This'll give you a structure like:

{
    [251, "john"] => ["bar", "bat", "baz", "foo"],
    [347, "bill"] => ["bar", "foo"]
}

and a maxitems that tells you how long the longest items array is, which you can then use for whatever you need.

answered Mar 7, 2012 at 18:15

glenn mcdonald

15.5k4 gold badges38 silver badges40 bronze badges

Comments

suvankar · Accepted Answer · 2012-03-07 14:07:08Z

0

You can use 'uniq'

irb(main):009:0> row= ['ruby', 'rails', 'gem', 'ruby']
irb(main):010:0> row.uniq
=> ["ruby", "rails", "gem"]
or 

row.uniq!
=> ["ruby", "rails", "gem"]

irb(main):017:0> row
=> ["ruby", "rails", "gem"]

irb(main):018:0> row = [1,      251,   'john',    'foo',       'foo',       'foo',       'foo']
=> [1, 251, "john", "foo", "foo", "foo", "foo"]
irb(main):019:0> row.uniq
=> [1, 251, "john", "foo"]

answered Mar 7, 2012 at 14:07

suvankar

1,5581 gold badge20 silver badges29 bronze badges

1 Comment

Marcos Over a year ago

That applies to Array. I wish CSV had a method like that.

Collectives™ on Stack Overflow

ruby CSV duplicate row parsing

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related