3

I have a problem reading a plain csv file, generated with Excel 2013. It seems the encoding is not working correctly within the TStreamReader class. The strange thing is, one file is working the other one not. When reading the second file, TStreamReader returns an empty string:

LString := FEncoding.GetString(LBuffer, StartIndex, ByteBufLen);

Both files have the 1 Byte ANSI encoding. But TStreamReader is using UTF8 encoding.

My code:

  fs := TFileStream.Create(aFileName, fmOpenRead or fmShareDenyNone);
  sr := TStreamReader.Create(fs);
  while (not sr.EndOfStream) do //sr.EndOfStream is always true!!!!
  begin
    //some code here
  end;

So far I figured out, that the following function is returning an empty string:

function TMBCSEncoding.GetCharCount(Bytes: PByte; ByteCount: Integer): Integer;
begin
  Result := UnicodeFromLocaleChars(FCodePage, FMBToWCharFlags,
    PAnsiChar(Bytes), ByteCount, nil, 0);
end;

When I compare both files, they have the same inputs beside the Bytes and ByteCount variable. But the Bytes startes with same values (same csv header names).

So my question is, why is one file working and the other not? What can I do to read the files correctly?

1 Answer 1

7

The constructor for TStreamReader that you call is this one:

constructor TStreamReader.Create(Stream: TStream);
begin
  Create(Stream, TEncoding.UTF8, True);
end;

The True argument is DetectBOM. If a BOM is encountered, that will determine the encoding. Otherwise the file will be treated as UTF-8. Your files don't have BOMs. Therefore you are getting exactly what you asked for. Namely that the file is treated as UTF-8.

If you want the file treated as ANSI you must specify the encoding:

sr := TStreamReader.Create(fs, TEncoding.Default);

Or if you want to default to ANSI if no BOM is found, otherwise respect the BOM, you can do it like this:

sr := TStreamReader.Create(fs, TEncoding.Default, True);

Why does your code work with one file but not the other? Presumably one file is entirely in the ASCII range, and the other has characters outside that range. UTF-8 encodes characters in the ASCII range in a single byte which means that ASCII encoded files are correctly interpreted by the UTF-8 encoding. That was one of the primary design goals of UTF-8.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. Is it possible to auto detect the encoding? Since I don't know which encoding my customers are using in the future.
Well, csv files from Excel are always going to be ANSI I think. Of course, you might not know which ANSI code page. Auto detecting encoding is in general impossible to get exactly correct. I don't really want to give you any more advice because I don't know your problem in detail. I don't know where the files come from, whether it is practical to ask the user to specify encoding and so on.
Some online banking portals allowing to export their data as csv files. I ancountered, that different providers are generating their csv files manually with different encodings. I found a solution, and will update it. Hope it is more or less pullet prooved.
I rolled back your question edit. It seemed to ask a different question. If you want to detect a BOM and use the appropriate encoding, use the code in my updated answer. Remember though that my original answer was when you stated that the files were always ANSI.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.