1
  • I created a console application in C#. The program refreshes host names in my database while reading data from an Excel sheet using ExcelDataReader.

    The Excel file is saved in .CSV format. One of the rows has a foreign name, for example:

    AZULCÃNEPA
    
    

    However, when I open it in my program or even in Notepad, the name appears incorrectly as:

    AZULCÁNEPA
    
    

    If I save the file as CSV UTF-8 or as .XLSX, then I get the correct name (AZULCÃNEPA) in my program.

    Why does this happen when saving as normal .CSV, and how can I make ExcelDataReader read the correct characters?

What I tried:

Opened the .CSV file in Notepad and confirmed the issue is the same (foreign characters appear corrupted).

Tried reading the file using ExcelDataReader in C#, but still got the corrupted text (AZULCÁNEPA).

Tried reading the file using FallbackEncoding = Encoding.UTF8

Tried to convert the file from UTF-8 to UTF-8 BOM using code

Tried to convert the file Bytes File.WriteAllText(outputPath1, File.ReadAllText(inputPath, Encoding.Default), new UTF8Encoding(true));

Saved the file again in UTF-8 CSV and also as .XLSX — in both cases, the characters display correctly in my program.

What I expected: I expected the characters to appear as AZULCÃNEPA even when the Excel file is saved as a normal .CSV (not just UTF-8 or .XLSX). or can what can i do to get it right ?

10
  • 3
    What do you mean by "saving as normal .csv"? I suspect that's using the system default encoding - which is basically a bad idea. Why do you object to using UTF-8 here? Commented Sep 2 at 7:24
  • 1
    your program must open the file with the same encoding it was saved in. Unfortunately, text files don't save their encoding. When you open a text-file, the opening program must guess the encoding -- thus, to save headaches, read and write all text in utf-8 explicitly. Commented Sep 2 at 7:31
  • It'll be using a Windows Code Page. Try specifying Encoding.Default when you open it. But if you have control over the CSV encoding, use UTF8 to fix it. Commented Sep 2 at 7:55
  • 1
    Then tell them to fix it ;) Commented Sep 2 at 7:57
  • 1
    "im getting the excel from my team" - then I suggest you ask your team to save it as UTF-8. It's either that, or find out exactly what encoding is being used however they save it. Commented Sep 2 at 8:09

1 Answer 1

1

Text files do not have information in them about what encoding they use. When a program opens a text-file for reading or writing, it must choose the correct encoding somewhat blindly. More complex file-formats will specify how text in them is to be interpreted, this is why you only see the problem with csv and not with xlsx.

Decoding with the wrong encoding may fail but it can just produce wrong text. Relying on that does not save you in your case. There are also heuristics to 'guess' the encoding of a file when reading by looking at whether the decoded text makes sense, but this is all an imperfect process. The only 'safe' way is to specify the correct encoding. This is why your attempt to solve this with "FallbackEncoding" does not work: the fallback encoding is only used when decoding fails, as per this source-code comment from ExcelReaderConfiguration

Most of the computing world has adopted UTF-8 as a standard encoding. Windows has a system default which is usually not UTF-8. So any program and sometimes user on windows must choose whether to go by the windows-standard or the world's standard. Excel and notepad seem to have made different choices here. Both choices make sense in isolation: Excel has a high focus on backwards compatibility, so the output of 'save to csv' remains as non-utf-8 even in newer versions, notepad on the other hand wants to be able to correctly open most files.

The easiest solution to this problem is to use only UTF-8 on all text-files you handle. So make sure to save excel-files correctly.

If that is not an option, you need to specify the encoding when reading the file. FallbackEncoding is the wrong way to do this for reasons stated above. This encoding needs to be specified when opening your File, not when you CreateCsvReader from that file.

So try:

StreamReader reader = new StreamReader(inputFilePath, Encoding.GetEncoding(0), true)

In .NET Framework, Encoding.Default is equivalent, but in .NET Core this will always be UTF-8. https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.default?view=net-9.0&redirectedfrom=MSDN#remarks

This of course relies on the CSV being encoded in your computers default encoding. If there is yet another encoding, you have to get more explicit than this.

Sign up to request clarification or add additional context in comments.

3 Comments

Note that on .NET Core, Encoding.Default will be UTF8.
@MatthewWatson Good point! I assumed it meant System default, but this is only the case in .NET Framework. I have corrected the post.
For .NET Core you can get a specific code page encoding via (for example): var encoding = CodePagesEncodingProvider.Instance.GetEncoding(1252); - That gets the Latin One code page, the one for Western Europe. However, I tested the OPs string with that, and converting "AZULCÃNEPA" to that code page and then converting the result back to UTF8 doesn't result in the same string as the OP sees. So I have no idea what's going on there...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.