I am opening a file and placing it's contents into a string buffer to do some lexical analysis on a per-character basis. Doing it this way enables parsing to finish faster than using a subsequent number of fread() calls, and since the source file will always be no larger than a couple MBs, I can rest assured that the entire contents of the file will always be read.
However, there seems to be some trouble in detecting when there is no more data to be parsed, because ftell() often gives me an integer value higher than the actual number of characters within the file. This wouldn't be a problem with the use of the EOF (-1) macro, if the trailing characters were always -1... But this is not always the case...
Here's how I am opening the file, and reading it into the string buffer:
FILE *fp = NULL;
errno_t err = _wfopen_s(&fp, m_sourceFile, L"rb, ccs=UNICODE");
if(fp == NULL || err != 0) return FALSE;
if(fseek(fp, 0, SEEK_END) != 0) {
fclose(fp);
fp = NULL;
return FALSE;
}
LONG fileSize = ftell(fp);
if(fileSize == -1L) {
fclose(fp);
fp = NULL;
return FALSE;
}
rewind(fp);
LPSTR s = new char[fileSize];
RtlZeroMemory(s, sizeof(char) * fileSize);
DWORD dwBytesRead = 0;
if(fread(s, sizeof(char), fileSize, fp) != fileSize) {
fclose(fp);
fp = NULL;
return FALSE;
}
This always appears to work perfectly fine. Following this is a simple loop, which checks the contents of the string buffer one character at a time, like so:
char c = 0;
LONG nPos = 0;
while(c != EOF && nPos <= fileSize)
{
c = s[nPos];
// do something with 'c' here...
nPos++;
}
The trailing bytes of the file are usually a series of ý (-3) and « (-85) characters, and therefore EOF is never detected. Instead, the loop simply continues onward until nPos ends up being of higher value than fileSize -- Which is not desirable for proper lexical analysis, because you often end up skipping the final token in a stream which omits a newline character at the end.
In a Basic Latin character set, would it be safe to assume that an EOF char is any character with a negative value? Or perhaps there is just a better way to go about this?
#EDIT: I have just tried to implement the feof() function into my loop, and all the same, it doesn't seem to detect EOF either.
nPos == fileSizeis one beyond the end of the memory you allocated.fread()won't report EOF; you asked to read what was in the file. If you triedgetc(fp)after thefread(), you'd get EOF unless the file had grown since you measured how long it is. Since_wfopen_s()is a non-standard function, it might be affecting howftell()behaves and the value it reports. No; it is not safe to assume that any negative char value is EOF. The type plaincharmay be signed or unsigned.new[fileSize]. It probably isn't idiomatic C++, but it is definitely not C.fgetc()orfgetwc(), depending on how you're handling the file itself, and is not related to your modus. But you're opening the file in binary mode, which I honestly didn't even know was supported with accsencoding mode. You buffer should be properly sized in bytes if you used your calculate file length+1 (the +1 for the terminator). If opening in binary-mode and specifying an encoding hint to request a BOM analysis works, so much the better.