Encoding issue in TStreamReader with plain csv file

Question

I have a problem reading a plain csv file, generated with Excel 2013. It seems the encoding is not working correctly within the TStreamReader class. The strange thing is, one file is working the other one not. When reading the second file, TStreamReader returns an empty string:

LString := FEncoding.GetString(LBuffer, StartIndex, ByteBufLen);

Both files have the 1 Byte ANSI encoding. But TStreamReader is using UTF8 encoding.

My code:

  fs := TFileStream.Create(aFileName, fmOpenRead or fmShareDenyNone);
  sr := TStreamReader.Create(fs);
  while (not sr.EndOfStream) do //sr.EndOfStream is always true!!!!
  begin
    //some code here
  end;

So far I figured out, that the following function is returning an empty string:

function TMBCSEncoding.GetCharCount(Bytes: PByte; ByteCount: Integer): Integer;
begin
  Result := UnicodeFromLocaleChars(FCodePage, FMBToWCharFlags,
    PAnsiChar(Bytes), ByteCount, nil, 0);
end;

When I compare both files, they have the same inputs beside the Bytes and ByteCount variable. But the Bytes startes with same values (same csv header names).

So my question is, why is one file working and the other not? What can I do to read the files correctly?

David Heffernan · Accepted Answer · 2015-03-10 09:59:47Z

7

The constructor for TStreamReader that you call is this one:

constructor TStreamReader.Create(Stream: TStream);
begin
  Create(Stream, TEncoding.UTF8, True);
end;

The True argument is DetectBOM. If a BOM is encountered, that will determine the encoding. Otherwise the file will be treated as UTF-8. Your files don't have BOMs. Therefore you are getting exactly what you asked for. Namely that the file is treated as UTF-8.

If you want the file treated as ANSI you must specify the encoding:

sr := TStreamReader.Create(fs, TEncoding.Default);

Or if you want to default to ANSI if no BOM is found, otherwise respect the BOM, you can do it like this:

sr := TStreamReader.Create(fs, TEncoding.Default, True);

Why does your code work with one file but not the other? Presumably one file is entirely in the ASCII range, and the other has characters outside that range. UTF-8 encodes characters in the ASCII range in a single byte which means that ASCII encoded files are correctly interpreted by the UTF-8 encoding. That was one of the primary design goals of UTF-8.

edited Mar 10, 2015 at 9:59

answered Mar 10, 2015 at 9:19

David Heffernan

616k46 gold badges1.1k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

markus_ja Over a year ago

Thanks. Is it possible to auto detect the encoding? Since I don't know which encoding my customers are using in the future.

David Heffernan Over a year ago

Well, csv files from Excel are always going to be ANSI I think. Of course, you might not know which ANSI code page. Auto detecting encoding is in general impossible to get exactly correct. I don't really want to give you any more advice because I don't know your problem in detail. I don't know where the files come from, whether it is practical to ask the user to specify encoding and so on.

markus_ja Over a year ago

Some online banking portals allowing to export their data as csv files. I ancountered, that different providers are generating their csv files manually with different encodings. I found a solution, and will update it. Hope it is more or less pullet prooved.

David Heffernan Over a year ago

I rolled back your question edit. It seemed to ask a different question. If you want to detect a BOM and use the appropriate encoding, use the code in my updated answer. Remember though that my original answer was when you stated that the files were always ANSI.

Collectives™ on Stack Overflow

Encoding issue in TStreamReader with plain csv file

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related