4

I'm trying to find the "right" way to read files line-by-line.

I have been using for for line in $(cat "$FILE"); do for a while, and I really enjoy its clearness.

I know that while IFS= read -r line; do ... done < "$FILE" should be more optimized (without a subshell), but I don't like that the file is specified at the end of the loop. Also, when testing it, I encountered some weird issues with variable scopes.

Recently I found out about mapfile -t LINES < $FILE which is supposed to be super-optimized, and it looks cleaner than while read, but the performance in my tests show that it is only faster on very small files.

So, my question is - does it make any sense to use other methods, rather than for line in $(cat "$FILE"); do. The only scenario that I can imagine, where it would be slower would be reading thousands of small files in a loop. In other cases, the difference is negligible while sacrificing readability

I took files of various sizes, and a script(below) to compare

################ test-med.txt (140698 lines) ###################
for line in $(cat "$FILE"); do

real    0m0,924s
user    0m0,812s
sys     0m0,128s
while IFS= read -r line; do

real    0m1,328s
user    0m1,113s
sys     0m0,215s
mapfile -t LINES < $FILE

real    0m1,240s
user    0m1,129s
sys     0m0,111s
################ test-small.txt (180 lines) ###################
for line in $(cat "$FILE"); do

real    0m0,050s
user    0m0,001s
sys     0m0,049s
while IFS= read -r line; do

real    0m0,001s
user    0m0,001s
sys     0m0,000s
mapfile -t LINES < $FILE

real    0m0,011s
user    0m0,006s
sys     0m0,005s
################ test-tiny.txt (32 lines) ###################
for line in $(cat "$FILE"); do

real    0m0,050s
user    0m0,000s
sys     0m0,050s
while IFS= read -r line; do

real    0m0,000s
user    0m0,000s
sys     0m0,000s
mapfile -t LINES < $FILE

real    0m0,000s
user    0m0,000s
sys     0m0,000s

Comparison script used:

#!/bin/bash


_t1() {
  IFS=$'\n'
  for line in $(cat "$FILE"); do
    echo "$line"
  done
}

_t2() {
  while IFS= read -r line; do
    echo "$line"
  done < "$FILE"
}

_t3() {
  mapfile -t LINES < $FILE
  for line in "${LINES[@]}"; do
    echo $line
  done
}


for FILE in $(ls *.txt); do
  CNT=$(cat $FILE | wc -l)
  echo "################ $FILE ($CNT lines) ###################"

  echo 'for line in $(cat "$FILE"); do'
  time _t1 >/dev/null

  echo 'while IFS= read -r line; do'
  time _t2 >/dev/null

  echo 'mapfile -t LINES < $FILE'
  time _t3 >/dev/null
done

5
  • 6
    for line in $(cat "$FILE") and while IFS= read -r line do different things, so a performance comparison by itself doesn't make sense. In general, if you care about performance, you probably shouldn't use a shell, and definitely not Bash. Also, there's never any reason to use $(ls *.txt), it can only break things. Commented May 14 at 11:11
  • 3
    Don't use for line in $(cat "$FILE"); do as it breaks when the input contains spaces and/or globbing metachars, and would skip any blank lines. Get a robust solution first and then think about performance. Commented May 14 at 12:21
  • As long as your can guarantee that the file/path names does not contain spaces/tabs/newlines, then your cat should probably do just fine, also your argument is a textbook classic example of why you DRLWF Commented May 14 at 12:23
  • Your test of mapfile is leading you to the wrong conclusion. If you want to compare how fast mapfile works to how fast an equivalent read-loop works you should be comparing only mapfile -t LINES < "$FILE" to while IFS= read -r line; do LINES+=( "$line" ); done, i.e. how fast can you populate an array from file contents, not how fast can you print the contents of a file as your current code is implementing. If you don't need an array then you wouldn't use mapfile (aka readarray) as it exists to populate an array. Commented May 14 at 12:33
  • 3
    Anyone using "bash" and "performance" in the same Question is barking up the wrong tree. Interpreted languages are inherently slower. Commented May 14 at 22:40

2 Answers 2

6

In almost every case, you should not read a file line-by-line using a shell loop at least not as your first instinct.

Often when someone uses a line-by-line loop in shell, what they actually need is a text processing tool like awk, sed, cut, jq, or perl.

That being said, if you really do need to read a file line-by-line in a shell script, the correct and safe way is:

while IFS= read -r line; do
  # process "$line"
done < "$file"

Why this is preferred:

  • Preserves whitespace: setting IFS= prevents word splitting.
  • Reads lines properly: using read -r ensures backslashes are not interpreted.
  • Handles all lines safely: even empty lines or lines with leading/trailing spaces.

issues with for loop

word splitting

However, The for loop doesn't iterate over lines, it iterates over words — splitting on all whitespace (spaces, tabs, newlines). This means:

$ cat test.txt
foo bar   baz
$ for line in $(cat test.txt); do echo "line is $line"; done
line is foo
line is bar
line is baz

Backslashes and quoting can get mangled

Any backslashes in the file may be misinterpreted by some versions of bash.

It spawns a subshell

Not always an issue, but it's unnecessary overhead. More importantly, if you capture output from within the loop, you'll often get bitten by scoping issues.

It fails on empty lines

Empty lines are lost entirely in the word splitting.


mapfile (or readarray) is great for reading the whole file into an array at once, but it only works in Bash, not POSIX sh.

4
  • 1
    Backslashes in $(cat test.txt) are only a "problem" in some versions of bash (namely 5.0 before that misfeature was reverted) where backslash is treated as a glob operator or before other glob operators on all versions. One of the main problems of $(cat test.txt) (other than it splits on characters of $IFS not newline, so you need IFS=$'\n') is that it undergoes globbing. Commented May 14 at 11:17
  • 4
    For reference, see Why is using a shell loop to process text considered bad practice? and Busy box Read file line by line. Commented May 14 at 11:20
  • Also note that question seems to be alternating between echo "$line" and echo $line, which may do different things, including interpreting the contents of thewariable as an option to echo and expanding backslash-encoded tabs and newlines, depending on shell settings. Commented May 15 at 8:20
  • also iirc bash variables cannot contain null bytes, while UTF-8 can legally contain null bytes. bash variables are not UTF-8 compatible! (see Java's MUTF-8) Commented May 16 at 4:05
2

It's a somewhat interesting question, but my advice is to not spend a lot of time worrying about it. The mechanism by which the shell reads and executes the commands in your script will, in most cases, override your attempts at optimizing the file I/O performance of your script. The execution of your script is slower than the savings you'll achieve by changing your I/O methods.

This is an example of the programming "sin" that Knuth wrote about:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

(see the article here for the cite, quote, and explanation)

So my advice is to pick the approach that is the most understandable in the context of the rest of the script's code. Updating and maintaining the script will be easier, which is a greater savings in time and effort than you'll get from tweaking pipelines, redirect operators, and while loops.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.