8

I have to deal with mainly English alphabets and all the punctuation marks, I don't have to worry about European accents. So the only concern I have is when a user paste something he copies from the web that includes, for instance, an apostrophe that when I do a puts in the console (on Win7), it outputs

"ItΓÇÖs" # where as it actually is " It's "

So my main question is, is there a end-it-all conversion method I can use in Ruby that just properly replaces all the ,.;?!"'~` _- with ASCII counter parts?

I really understand very little about encodings, if you think this is wrong question to ask, which can very likely be the case, please do advice as to what I should look for instead.

Thank you

3 Answers 3

6

I work in publishing where we deal with this a lot. We have had success with stringex https://github.com/rsl/stringex. They have a to_ascii method that normalizes unicode dashes etc.

Sign up to request clarification or add additional context in comments.

Comments

2

And in ruby 2.0:

"ItΓÇÖs".encode("ASCII", invalid: :replace, undef: :replace, replace: '')
 => "Its" 

Comments

1

For programmatically handling multibyte encodings iconv is your friend. And, James Grey wrote a series of blog articles talking about how to take apart the problem and convert encodings.

The problem gets more complicated when dealing with text that has been pasted in, because some characters could be in one multibyte-encoding, and other characters could be in another. You might have to walk the string checking for multibyte characters, then asking Ruby what the encoding is, and, if it's not what you expect, convert it to the expected or desired encoding, then move to the next character. Grey's articles cover it all nicely and are good reading.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.