-2

I am trying to run a function once for each file from a large list that I am piping in.

Here is some example code, here just grepping in the files whose names are coming from stdin.

In my real code I am running a program that takes significant time to process each file, but only generates output for some of the files that were processed, so grep is comparable.

I would also ideally like to capture the error code from the program to know which files failed to process and see the stderr output from the program echoed to the terminal's stderr.

#!/bin/bash

searchterm="$1"

filelist=$(cat /dev/stdin)

numfiles=$(echo "$filelist" | wc -l)
currfileno=0

while IFS= read -r file; do
    ((++currfileno))    
    echo -ne "\r\033[K" 1>&2 # clears the line
    echo -ne "$currfileno/$numfiles $file" 1>&2
    grep "$searchterm" "$file"
done <<< "$filelist"

I saved this as test_so_stream, and I can run it by find ~ -type f -iname \*.txt | test_so_stream searchtext.

The problem is that when I run it on a large list of thousands of files, nothing starts processing at all until the entire list is loaded, which can take significant time.

What I would like to happen is for it to start processing the first file immediately as soon as the first filename appears on stdin.

I know I could use a pipe for this, but I also would like the statusline (including the current file number and the total number of files) updated to stderr after the processing of each file, or every second or so.

I presume I'd need some kind of multithreading to process the list separately from the actual worker process/es, but I'm not sure how to achieve that using bash.

Bonus points if I can process multiple files at once in a worker pool, although I do not want the output from multiple files to be intermingled, I need the full output of one file, then the full output of the next, etc. This is a low priority for me if it's complicated and is not the focus of my question.

I have tried to use parallel and xargs, and I know at least parallel can process multiple files at once, in fact very close to what I want even with the output not intermingled, but I still can't work out how to have the status line updated at the same time so I know how far through the list of files it is at the same time. I know about the --bar option of parallel but it is too ugly for my taste and not customizable (I would like the status bar to have colors and show the filename being processed).

How can I achieve this?

edit to answer @markp-fuso questions in comments:

I know that stderr/stdout both show on the same terminal.

I would like the status bar to go to stderr so I can pipe the entire output from the program to save and further process the full output of the program. When I do this I will not be saving the stderr, that's just so I can watch the program while it's working. My example program does do this, it shows status and keeps overwriting that line until there's some output. In my full program it clears the status line and overwrites it with the output, if there is output for that file. I omitted the checking if there is output and the line clear for my example program because that's not the part of the question that's important to me.

Re: the status bar not knowing the total number of files, I want the status bar to show the current number of total files and update it when more are piped in, eg like pv does. I imagine having one process that loads a global filelist from stdin and echoes the status bar to stderr every second, while another process simultaneously loops through that global filelist, processing every file. The problem I'm trying to avoid is the parent process does not know the total number of files immediately, it takes significant time to generate the entire list and I would like my processing to start immediately.

Perhaps calling it a status bar may be overstating what I mean. I just want to be able to see something showing how far it is through the list of files, and which file it is currently processing. Nothing super fancy but I want it to be in color so it stands out on the console from the stdout data. One colored line at the bottom that is continuously overwritten to show me that it is still working.

if you manage to spawn 3 parallel threads, how exactly do you envision their outputs and status bar(s) being displayed in a single console/terminal window?

Exactly like cat filelist | parallel grep searchterm does. Ie The grep output for each file shown consecutively, not intermingled. The status bar can appear anywhere (because I'm not saving that), although I would rather it appeared in between the output. Ie like then if there's another grep output it should overwrite the statusline at the bottom. Then more statusline and the cycle continues. So the statusline is just continually getting overwritten to show me what file it's up to.

9
  • if the total number of files (N) is not known until you read all of stdin, then how do you expect to immediately start processing stdin and print a status bar of 1/N <filename>? does the parent process know, in advance, the number of files to be processed? Commented Mar 26 at 16:47
  • have you tried running: your two echo -ne calls, echo 'grep output', your two echo -ne calls; you should see 3 lines of output; assuming you want the 'status bar' to remain in one place (eg, bottom of console/terminal window) then you're going to need to incorporate some sort of cursor/curses processing (eg, tput) to allow for the explicit placement of output in the console/terminal window Commented Mar 26 at 16:47
  • your code and comments seem to imply a belief that stdout and stderr are somehow printed to different areas of the console/terminal window; this is not true; both (stdout and stderr) are printed to the current location of the cursor; while you can dump stdout/stderr to different areas you'll (again) need to add cursor/curses processing calls Commented Mar 26 at 16:47
  • if you manage to spawn 3 parallel threads, how exactly do you envision their outputs and status bar(s) being displayed in a single console/terminal window? how do you keep the output of 3 threads from being interspersed/scrambled in the console/terminal window? are you expecting 3 separate status bars or a single staus bar that reads something like (1,7,24)/N <file1> <file7> <file24> ... keeping in mind the next question of how would you build this single status bar from the current processing status of 3 parallel threads (aka subshells) Commented Mar 26 at 16:47
  • at this point there are a lot of unknowns about how the 'status bar' is supposed to behave; it's not clear, from what we've been told, if a lot of details have been left out or if we're dealing with an incomplete design/requirement Commented Mar 26 at 16:48

2 Answers 2

1

I'm not 100% clear on all of OP's requirements so I'm going to focus on a stderr to status line and stdout to a file approach. Hopefully this will get OP a bit closer to the final goal ...

Assumptions/understandings:

  • one program is generating a list of files (we'll call this gen_output; filenames are output-#)
  • this output needs to be split and fed as stdin to two different programs ...
  • one program counts the numbers of files read from stdin (we'll call this count_input) and prints the new count to file counter
  • one program processes the files read from input while also generating a status bar (we'll call this process_input)
  • the status bar should be a count of processed files plus the 'count' (from count_input) at that point in time, plus the current file being processed
  • the status bar is printed to stderr
  • the process_input stdout is written to file process_input.stdout

The 3 programs:

######################### generate 10 outputs at 0.5 second intervals

$ cat gen_output
#!/bin/bash

for ((i=1;i<=10;i++))
do
    echo "output-$i"
    sleep .5
done

######################### for each input update a counter and overwrite file 'counter'

$ cat count_input
#!/bin/bash

count=0

while read -r input
do
    ((count++))
    echo "${count}" > counter
done

######################### for each input read current total from file 'counter' and then print status line

$ cat process_input
#!/bin/bash

touch counter
count=0
cl_eol=$(tput el)             # clear to end of line

while read -r input
do
    ((count++))
    read -r total < counter

    printf "\rprocessing %s/%s %s%s" "${count}" "${total}" "${input}" "${cl_eol}" >&2
    echo "something to stdout - ${count} / ${total}"
    sleep 2
done > process_input.stdout

printf "\nDone.\n" >&2

Using tee to feed a copy of gen_output to process_input before piping to count_input:

$ ./gen_output | tee >(./process_input) | ./count_input

I've got a .gif of this in action but SO is not allowing me to upload the image at this time so imagine the following lines being displayed, one at a time at 2 second intervals, while overwriting the previous line:

processing 1/1 output-1
processing 2/4 output-2
processing 3/8 output-3
processing 4/10 output-4
processing 5/10 output-5
processing 6/10 output-6
processing 7/10 output-7
processing 8/10 output-8
processing 9/10 output-9
processing 10/10 output-10

And then a new line is displayed:

Done.

And the stdout:

$ cat process_input.stdout
something to stdout - 1 / 1
something to stdout - 2 / 4
something to stdout - 3 / 8
something to stdout - 4 / 10
something to stdout - 5 / 10
something to stdout - 6 / 10
something to stdout - 7 / 10
something to stdout - 8 / 10
something to stdout - 9 / 10
something to stdout - 10 / 10

Sign up to request clarification or add additional context in comments.

7 Comments

Interesting solution using three programs! Makes sense. Is there no way to do it all in a single program and not having to write the counter file to disk? I'm not sure how I should "package" 3 scripts neatly together on my system. Usually I put all my scripts in a ~/bin folder. What is the purpose of the touch counter line? Also is there a chance of a race condition between writing and reading the counter file? Thanks again. I'll leave this for a few days and then accept your answer.
I was able to make it one file by making process_input and count_input into functions contained in one script that calls cat /dev/stdin | tee >(process_input) | count_input . Thanks so much for the idea. Now my only (academic) question is is there a way to not save the counter to disk but to RAM?
really just two programs, the 1st program (gen_output) is a stand-in for generating a list of 'files' over a period of time; yeah, functions work, too; touch counter is there to insure the file exists before reading from the file otherwise you'll get an error if trying to read from a non-existent file; as for RAM vs file, you're now stepping into a different issue ... interprocess communications ... whole 'nother topic with lots of possible solutions (sockets, signals, (un)named pipes, dynamic file descriptors, message queues, database) ...
the interprocess comms gets interesting in this case since you'll need to decide which process will maintain the in-memory counter, then plan for a query-and-answer ability between the 'counter keeper' and the (multiple?) process_input process; count_input would need to remain running once stdin is drained, and it would need to handle, in essence, interrupt ('what's current counter') requests; as for the possible race condition on writing/reading counter ... sure, possibility exists; if you need to insure exclusive access you could implement a locking mechanism (eg, flock)
another option would be placing the file in RAM; it would still be processed as a file but it's not sitting on a (physical) disk; here's one idea using the /run/user/<id> directory
|
0

One way could be to define a function that prints your progess bar. Like this:

bar_full="=================================================="
bar_empty="                                                  "
len_bar=${#bar_full}

function show_bar() {
    step=$1
    step_total=$2
    progress_ticks=$(((${step} * ${len_bar})/${step_total}))
    progress_percent=$(((${step} * 100)/${step_total}))
    echo -n -e "\r|${bar_full:0:${progress_ticks}}${bar_empty:${progress_ticks}:${len_bar}}| ${progress_percent}%"
}

Then you use it in your code. As allways with progress bars, you need to know how many steps you have. So let's assume you get the files by a ls command, then you could do it as in the following example:

max_value=$(ls *.txt | wc -l)
i=0
for file in $(ls *.txt) ; do
   ((i++)) 
   show_bar $i ${max_value}
   sleep 1s
done
echo

6 Comments

Thanks for the idea but this doesn't address the main part of my question which is how to get it to start processing immediately and updating the status line while processing. At the start you don't know the total number of files that will be in the stream. Your code is exactly like mine in that it needs to fetch the entire list of files before starting.
Are you printing a bar? do I miss somehting? And how do you want to display a progress bar if you don't know how many lines you have? Do you want to apply AI techniques?
Imagine i=0;while true;do echo "file$((++i))";sleep 1;done | myprog - it will never end the stream of files. I want myprog to start processing files immediately and since the processing of each file might take more than a second, I want the status line to show the current total. Generating a status line is not the problem I have - starting the processing immediately and simultaneously doing the processing and showing a status line is.
Then the title of your question is misleading. You wrote, that you want to have a progress bar updated, but in fact you don't care about the progress bar but just want some statistics to be displayed.
The progress bar you've demonstrated does not update while processing files simultaneously. Also there's much more than a title to the question.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.