I am trying to run a function once for each file from a large list that I am piping in.
Here is some example code, here just grepping in the files whose names are coming from stdin.
In my real code I am running a program that takes significant time to process each file, but only generates output for some of the files that were processed, so grep is comparable.
I would also ideally like to capture the error code from the program to know which files failed to process and see the stderr output from the program echoed to the terminal's stderr.
#!/bin/bash
searchterm="$1"
filelist=$(cat /dev/stdin)
numfiles=$(echo "$filelist" | wc -l)
currfileno=0
while IFS= read -r file; do
((++currfileno))
echo -ne "\r\033[K" 1>&2 # clears the line
echo -ne "$currfileno/$numfiles $file" 1>&2
grep "$searchterm" "$file"
done <<< "$filelist"
I saved this as test_so_stream, and I can run it by find ~ -type f -iname \*.txt | test_so_stream searchtext.
The problem is that when I run it on a large list of thousands of files, nothing starts processing at all until the entire list is loaded, which can take significant time.
What I would like to happen is for it to start processing the first file immediately as soon as the first filename appears on stdin.
I know I could use a pipe for this, but I also would like the statusline (including the current file number and the total number of files) updated to stderr after the processing of each file, or every second or so.
I presume I'd need some kind of multithreading to process the list separately from the actual worker process/es, but I'm not sure how to achieve that using bash.
Bonus points if I can process multiple files at once in a worker pool, although I do not want the output from multiple files to be intermingled, I need the full output of one file, then the full output of the next, etc. This is a low priority for me if it's complicated and is not the focus of my question.
I have tried to use parallel and xargs, and I know at least parallel can process multiple files at once, in fact very close to what I want even with the output not intermingled, but I still can't work out how to have the status line updated at the same time so I know how far through the list of files it is at the same time. I know about the --bar option of parallel but it is too ugly for my taste and not customizable (I would like the status bar to have colors and show the filename being processed).
How can I achieve this?
edit to answer @markp-fuso questions in comments:
I know that stderr/stdout both show on the same terminal.
I would like the status bar to go to stderr so I can pipe the entire output from the program to save and further process the full output of the program. When I do this I will not be saving the stderr, that's just so I can watch the program while it's working. My example program does do this, it shows status and keeps overwriting that line until there's some output. In my full program it clears the status line and overwrites it with the output, if there is output for that file. I omitted the checking if there is output and the line clear for my example program because that's not the part of the question that's important to me.
Re: the status bar not knowing the total number of files, I want the status bar to show the current number of total files and update it when more are piped in, eg like pv does. I imagine having one process that loads a global filelist from stdin and echoes the status bar to stderr every second, while another process simultaneously loops through that global filelist, processing every file. The problem I'm trying to avoid is the parent process does not know the total number of files immediately, it takes significant time to generate the entire list and I would like my processing to start immediately.
Perhaps calling it a status bar may be overstating what I mean. I just want to be able to see something showing how far it is through the list of files, and which file it is currently processing. Nothing super fancy but I want it to be in color so it stands out on the console from the stdout data. One colored line at the bottom that is continuously overwritten to show me that it is still working.
if you manage to spawn 3 parallel threads, how exactly do you envision their outputs and status bar(s) being displayed in a single console/terminal window?
Exactly like cat filelist | parallel grep searchterm does. Ie The grep output for each file shown consecutively, not intermingled. The status bar can appear anywhere (because I'm not saving that), although I would rather it appeared in between the output. Ie like then if there's another grep output it should overwrite the statusline at the bottom. Then more statusline and the cycle continues. So the statusline is just continually getting overwritten to show me what file it's up to.
N) is not known until you read all of stdin, then how do you expect to immediately start processing stdin and print a status bar of1/N <filename>? does the parent process know, in advance, the number of files to be processed?echo -necalls,echo 'grep output', your twoecho -necalls; you should see 3 lines of output; assuming you want the 'status bar' to remain in one place (eg, bottom of console/terminal window) then you're going to need to incorporate some sort of cursor/curses processing (eg,tput) to allow for the explicit placement of output in the console/terminal window(1,7,24)/N <file1> <file7> <file24>... keeping in mind the next question of how would you build this single status bar from the current processing status of 3 parallel threads (aka subshells)