string encoding in C# - strange characters

Question

I have a file that i need to import. The problem is that I have problems with a lot of characters in that file.

For example these names are wrong:

BjÃ¶rn (in file) - Should be Björn

Ã…ke (in file) - Should be Åke

Unfortunately I can't recreate the file with the correct encoding. Also there are a lot of characters that are wrong (these was just examples). I can't do a search and replace on all (if there isn't a dictionary with all conversions).

Can I decode the strings in some way?

thanks Patrik

Edit: Just some more info that I should added before (I blame my tiredness). The file is an .xlsx file.

UTF-8? I'm not sure if I understand your question well: 1) do you know which encoding is used and don't know how to use it in .NET or 2) are you looking for a way to determine the encoding? — Ondrej Tucny
– Ondrej Tucny, Commented Oct 13, 2011 at 21:06
You can try and save the file as Unicode. Notepad, file save as, pick unicode. If the file was saved previously with the wrong encoding, then they will have resend the file with the correct encoding. Unincode would be preferred as all the characters will be there. The same goes try for opening, the right encoding should be used to open and read the file, otherwise not all the characters may be able to be read in. — Jon Raynor
– Jon Raynor, Commented Oct 13, 2011 at 21:07

David Heffernan · Accepted Answer · 2011-10-13 21:17:15Z

4

I debugged this with Notepad++. I copied the correct strings into Notepad++. I used Encoding | Convert to UTF-8. Then I selected Encoding | Encode as ANSI. This has the effect of interpreting the UTF-8 bytes as if they were ANSI. And when I did this I end up with the same erroneous values as you. So clearly when you read the file you are interpreting is as ANSI rather than UTF-8.

The solution then is that your file has been encoded as UTF-8. Make sure that the file is interpreted as UTF-8 when you read it. I can't tell you exactly how to do that since you didn't show how you were reading the file in the first place.

It's possible that your file does not contain a byte-order-mark (BOM). If so then specify the encoding when you read the file by passing Encoding.UTF8.

edited Oct 13, 2011 at 21:17

answered Oct 13, 2011 at 21:11

David Heffernan

616k46 gold badges1.1k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PKK Over a year ago

Thanks a lot. You solved my problem!!! In Excel the characters in the file looked wrong (as I described earlier) and also when I imported the content with Linq to Excel. I saved the file (in Excel) to an ordinary text file and now the characters are correct.

Jon Skeet · Accepted Answer · 2011-10-13 21:11:39Z

0

I've just tried your first example, and it definitely looks like that's UTF-8.

It's unclear what you're using to look at the file in the first place, but if you load it with a text editor which understands UTF-8 and tell it that it's a UTF-8 file, it should be fine.

When you load it with .NET, you should just be able to use File.OpenText, File.ReadAllText etc - most IO dealing with encodings in .NET defaults to UTF-8 anyway.

answered Oct 13, 2011 at 21:11

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

1 Comment

David Heffernan Over a year ago

It's probably a UTF-8 file with no BOM

Collectives™ on Stack Overflow

string encoding in C# - strange characters

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related