52

I am trying to parse a CSV file generated from an Excel spreadsheet.

Here is my code

require 'csv'
file = File.open("input_file")
csv = CSV.parse(file)

But I get this error

ArgumentError: invalid byte sequence in UTF-8

I think the error is because Excel encodes the file into ISO 8859-1 (Latin-1) and not in UTF-8

Can someone help me with a workaround for this issue, please

Thanks in advance.

2
  • The best solution is to have excel encode in utf-8 Commented Dec 5, 2011 at 3:06
  • In case you need to support various encodings and detect at import, Charlock Holmes worked great for me. See stackoverflow.com/a/12234195/1343535 Commented Feb 5, 2018 at 17:35

7 Answers 7

74

You need to tell Ruby that the file is in ISO-8859-1. Change your file open line to this:

file=File.open("input_file", "r:ISO-8859-1")

The second argument tells Ruby to open read only with the encoding ISO-8859-1.

Sign up to request clarification or add additional context in comments.

4 Comments

This was giving me fits too, and your solution is working so far for me! Thanks!
worked like a champ. I was doing an iconv -f ISO-8859-1 -t utf-8 oldfilename > newfilename before i found this answer.
@jnunn: So glad I could help! Ruby encodings are harry things, and not that easy to deal with.
if you're here and using 'roo' gem, here the docs: github.com/roo-rb/roo#csv-support says you can send it as a symbol: s = Roo::CSV.new("mycsv.csv", csv_options: {encoding: Encoding::ISO_8859_1})
36

Specify the encoding with encoding option:

CSV.foreach(file.path, headers: true, encoding:'iso-8859-1:utf-8') do |row|
  ...
end

1 Comment

I prefer declaring keyword arguments explicitly +1 for encoding:
11

You can supply source encoding straight in the file mode parameter:

CSV.foreach( "file.csv", "r:windows-1250" ) do |row|
   <your code>
end

1 Comment

This worked in Ruby 2.1.5 but you have to do encoding: 'iso-8859-1' instead of "r:windows-1250".
1

If you have only one (or few) file, so when its not needed to automatically declare encoding on whatever file you get from input, and you have the contents of this file visible in plaintext (txt, csv etc) separated with i.e. semicolon, you can create new file with .csv extension manually, and paste the contents of your file there, then parse the contents like usual.

Keep in mind, that this is a workaround, but in need of parsing in linux only one big excel file, converted to some flavour of csv, it spares time on experimenting with all those fancy encodings

Comments

0

Save the file in utf-8, unless for some reason you need to save it differently in which case you may specify the encoded set while reading the file

Comments

0

add second argument "r:ISO-8859-1" as File.open("input_file","r:ISO-8859-1" )

Comments

0

I had this same problem and was just using google spreadsheets and then downloading as a CSV. That was the easiest solution.

Then I came across this gem

https://github.com/singlebrook/utf8-cleaner

Now I don't need to worry about this issue at all. Hope this helps!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.