0

POSIX defines a text file as "a file that contains characters organized into zero or more lines". However, according to POSIX definition of a line, there are two possible "kinds" of text files with zero lines:

  • an empty file
  • a non-empty file consisting solely of a single incomplete line (a line with at least one character and no \n).

Is the second case above really considered a text file? Intuitively it's not, as it seems to be nothing more but an edge case. Still, such a file certainly has no lines (as POSIX defines them) because it does not contain a newline character, so it seems to adhere to the current POSIX definition of a text file. The only issue open to interpretation here may be whether such files can be considered "organized into lines", but personally I see no reason against it - nothing in the POSIX definition forces me to treat them otherwise than empty files which, despite being empty (and thus trivially having zero lines), are considered "organized in lines". Perhaps I'm splitting hairs here, but I think the definition, as it currently stands, requires a slight augmentation in order to exclude this case (unless I'm wrong and such files are indeed supposed to match the POSIX definition of a text file).

In this thread:

What conditions must be met for a file to be a text file as defined by POSIX?

the OP had similar doubts. In one of replies to his questions someone claimed that "a text file shall not have incomplete lines". While this extra statement would indeed exclude the disputed case, such a restriction does not explicitly appear in the POSIX definition of a text file.

26
  • 1
    If you think foo is a valid text file since it contains zero lines, why not also assume that foo\nbar is a valid text file since it contains more than zero lines? Commented Dec 17, 2024 at 14:24
  • 1
    @ilkkachu Careful, you went one step too far! "foo\nbar" is NOT a valid text file, as it contains one "line" and one "incomplete line", NOT two lines (POSIX is very specific, albeit sometimes counterintuitive about what a "line" is). It's a mix-up of a line and trailing part which is not considered a line, hence your example does not constitute a valid text file. Commented Dec 17, 2024 at 14:30
  • 2
    @Peter, why not? It contains one line, so obviously "zero or more lines". Why would you accept the additional non-line in one case (zero complete lines) but not in the other (more than zero complete lines)? I don't see the definition make a difference between the cases of zero and more than zero. (And if the step I'm taking too far is the step of taking your argument into its logical conclusion, then, well, if you don't like that...) Commented Dec 17, 2024 at 14:34
  • 1
    @ilkkachu Because "foo\nbar" would violate the condition that "a text file needs to have its data organized into lines". While the condition is underspecified (especially in the "zero-lines" case, hence my thread), I think we can agree that it has a well-defined meaning for any file containing at least one line - in this common case it means that every character in the file is contained within some line, empty lines are allowed as well, but no content is allowed after the last line. Commented Dec 17, 2024 at 14:42
  • 2
    gonna contradict @ilkkachu there. The wording of the standard is clear about what a text file, and what a line is, and that an empty file explicitly is a text file. There's really no reason to appease any misinterpretation of that. (Peter to answer your question: literally the rest of the sentence, "characters organized into…" says that if a character is not part of a string of characters ended by \n, it can't be a text file). I'll be eqally blunt as ilkkachu there: this is all pretty straightforward, and you electing to have a problem with it is a bit of a "Peter" thing :) and doesn't… Commented Dec 17, 2024 at 14:58

2 Answers 2

3

The full definition of a text file is:

A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character. Although POSIX.1-2017 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections.

This is admittedly a particularly confusing choice of words in the spec. The way I have understood it, after the discussion in the Q&A you linked to, is that the part of the specs you quote should be read as "a file that contains characters organized into zero or more lines". I.e. that a file is a text file only if its contents are organized into lines.

Next, POSIX defines "lines" as:

A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

And empty lines as:

A line consisting of only a <newline>; see also Blank Line.

So, we know that:

  1. A text file needs to have its data organized into lines.
  2. A text file is allowed to have 0 lines.
  3. A line must end with a \n.

Combining these three points, this means that:

  • An empty file is a text file because it has its data organized into zero lines.
  • A non-empty file whose last character is not a newline is not a text file. This is because its data are not organized into lines.

I found the way Ilmari expressed it to me in a comment helpful, so I include it here:

[...] a POSIX text file is any file whose contents matches[sic] the regexp (.{0,M}\n)* (implicitly anchored and both ends), where \n matches a newline and . matches any character that is not a newline, and M is a placeholder for the numeric value LINE_MAX-1. In particular, this implies that an empty file is a valid text file consisting of zero lines, but that any non-empty text file must end in a newline (since otherwise it would contain an incomplete line, and an incomplete line is not a line).

10
  • But by applying this logic, both an empty file and a non-empty file without '\n' characters are "organized into zero lines", because both of them have zero "POSIX lines". Weird and artificial as it looks, if we consider a file with zero lines to "have data organized into lines", both equally apply. Or perhaps we need a more strict definition of a "file with data organized into lines" to make the proper distinction. Or, best of all, just add the statement "A text file shall not have incomplete lines." at the end of the POSIX definition which resolves the ambiguity. Commented Dec 17, 2024 at 13:34
  • You're quoting and linking to the previous version of the standard but that hasn't changed in the new version. Commented Dec 17, 2024 at 13:54
  • The (.{0,M}\n) vs "and M is a placeholder for the numeric value LINE_MAX-1" is inaccurate because the LINE_MAX limit is on a number of bytes, not characters while . matches a character, not byte. For instance, in a UTF-8 locale, a line could go over the limit if it has as few as LINE_MAX/4 characters because characters in UTF-8 can be encoded on up to 4 bytes. Commented Dec 17, 2024 at 13:58
  • @Peter yeah, I know, but I think the idea there is that a text file can only have lines, so non-\n characters without a \n at the end are not a line, therefore this file has non-lines and therefore is not a text file. Commented Dec 17, 2024 at 14:38
  • Thanks for the edit, @StéphaneChazelas. And yes, I am quoting the older specs because I started from the other answer, but as you confirm they're unchanged, I guess that's OK. As for the LINE_MAX, I'm sure you're right. The part that helped me understand was the regex bit, specifically that everything in the file had to match (.{0,M}\n)* which won't be the case if there are any non-\n characters that aren't terminated by a \n one. Commented Dec 17, 2024 at 14:40
2

Here's another way to look at this:

A file that contains characters organized into zero or more lines.

Your example file contains zero lines and a non-line.

The standard could be more clear by stating something like

A file that contains characters exclusively organized into zero or more lines.

or

A file that exclusively contains characters organized into zero or more lines.

Or something to that effect, but I think it is pretty clear as it stands now. There is no indication that the standard intends to allow a text file to contain non-lines or characters which aren't organized into lines.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.