I have some CSV data I need to process, and having trouble figuring out a way to match the duplicates.
data looks a bit like this:
line id name item_1 item_2 item_3 item_4
1 251 john foo foo foo foo
2 251 john foo bar bar bar
3 251 john foo bar baz baz
4 251 john foo bar baz pat
lines 1-3 are duplicates in this case.
line id name item_1 item_2 item_3 item_4
5 347 bill foo foo foo foo
6 347 bill foo bar bar bar
in this case only line 5 is a duplicate
line id name item_1 item_2 item_3 item_4
7 251 mary foo foo foo foo
8 251 mary foo bar bar bar
9 251 mary foo bar baz baz
here lines 7 and 8 are the duplicates
so basically if the pattern adds a new "item" the previous line is a duplicate. I want to end up with a single line for each person, regardless of how many items they have
I am using Ruby 1.9.3 like this:
require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
CSV.open("output-file", "wb") do |csv|
#write the first row (header) to the output file
csv << people[0]
people.each do |p|
... logic to test for dupe ...
csv << p.unique
end
end