How to capture STDOUT, STDERR and process PID, without creating zombies

Question

I'm wrapping a terraform binary in a script, as a part of an enterprise solution. Therefore I need to take care of:

file log capture (separately STDOUT, STDERR for some post-processing analytics)
live log capture (this runs within the Jenkins job, again separate for STDOUT and STDERR)
PID capture (as the process needs to run in the background, and in the next step I'll add traps for SIGTERM handling)

Currently, the core construct of the script looks like this:

#!/bin/bash
...
...
terraform "$@" > >(tee "${STDOUT_LOG}") 2> >(tee "${STDERR_LOG}" >&2) & TF_PID="$!"
wait "$TF_PID"
EXIT_CODE="$?"
...
wait
exit "$EXIT_CODE"

This script is called several hundred times in one container. We've noticed it leaves zombie processes, the shells within which tee commands are executed.

Adding a general wait before exiting the script doesn't help, the shell won't wait for these child processes to be reaped. I couldn't read much about the internals of process substitution, would you have a hint what might be going on here?

EDIT:

> ps aux --forest
  11295 ?        S      0:00      |       \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
  11376 ?        Sl     0:01      |       |   \_ terraform init
  11377 ?        S      0:00      |       |       \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
  11379 ?        S      0:00      |       |       |   \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stdout.log
  11378 ?        S      0:00      |       |       \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
  11380 ?        S      0:00      |       |           \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stderr.log

and after a few moments:

> ps aux
...
  11377 ?        Z      0:00 [terraform.sh] <defunct>
  11378 ?        Z      0:00 [terraform.sh] <defunct>
...

Please clarify: are you saying that the tee subshells are not collected even after the shell that launched them terminates? Or to put it another way: please be more specific about what you observe, and how you observe it. — John Bollinger
– John Bollinger, Commented Mar 11 at 13:28
Hello @JohnBollinger, added the required details. What surprised me is that the subshell (process 11377, 11378) are children of terraform-1.4 binary (PID 11376), not the parent script (PID 11295). — Bernard Halas
– Bernard Halas, Commented Mar 11 at 15:17
Thanks for the additional info. I have some followups. Do the zombies eventually get cleaned up? Are enough accumulating to cause a practical problem? — John Bollinger
– John Bollinger, Commented Mar 11 at 15:52
What's presumably happening is that the shell forks a child to run terraform. But before exec'ing terraform, the child forks its own children for the process substitutions, analogous to the way it opens files for redirection. — Barmar
– Barmar, Commented Mar 11 at 16:02
If zombies accumulate then that is probably because of unhelpful container construction / configuration. Since you mention kubelet, I guess you are using Kubernetes. See this Kubernetes issue: github.com/kubernetes/kubernetes/issues/84210 for some discussion. — John Bollinger
– John Bollinger, Commented Mar 11 at 17:26

KamilCuk · Accepted Answer · 2025-03-11 17:53:20Z

3

You can ignore process substitution and roll up your sleeves and do it yourself. That way all processes will be childs of the current process, so current process can track all lifetimes. Also, variables don't have to scream UPPERCASE. I think with coproc you could get away with one fifo less.

{
  # setup fifo stderr and stdout fifos
  stdout_fifo=$(mktemp -u)
  stderr_fifo=$(mktemp -u)
  mkfifo "$stdout_fifo" "$stderr_fifo"
  trap 'rm "$stdout_fifo" "$stderr_fifo"' EXIT
}
{
  # start it
  terraform "$@" >"$stdout_fifo" 2>"$stderr_fifo" &
  tf_pid=$!
  tee "$STDOUT_LOG" <"$stdout_fifo" &
  tee_stdout_pid=$!
  tee "$STDERR_LOG" <"$stderr_fifo" &
  tee_stderr_pid=$!
}
{
  # wait for it
  wait "$tf_pid"
  tf_exit_code="$?"
  wait "$tee_stdout_pid"
  wait "$tee_stderr_pid"
}

answered Mar 11 at 17:53

KamilCuk

146k8 gold badges84 silver badges154 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

John Bollinger · Accepted Answer · 2025-03-11 17:50:30Z

3

I couldn't read much about the internals of process substitution, would you have a hint what might be going on here?

Your ps data shows that Bash is making the tee subshells children of the terraform-1.4 command. I can see some reasons to do that, but perhaps more not to do. In particular, I can imagine situations in which having children that it didn't know about would make a process misbehave.^*

The subshells being children of the terraform command takes them out of the original shell's sphere of responsibility, even after the terraform-1.4 parent process terminates. Thus, yes, it is to be expected that the parent shell cannot successfully wait for them. The terraform-1.4 command doesn't know about these children so is unlikely to try to collect them, but even if it did try, they probably outlive it.

The issue, then, seems to be that whatever process in the container inherits responsibility for cleaning up zombies (PID 1, ordinarily) is not doing that job. Evidently this is a relatively well known issue with containers. As I observed in comments, there is a longstanding issue against Kubernetes for a means of mitigation. That issue also suggests how you could (re)build your container image so that it is not susceptible to this issue: give it an initial process that does handle zombies (and signals) in the way that a Unix PID 1 is responsible for doing. I don't have experience with any of the specific minimal init programs mentioned there (tini, dumb-init), and certainly not with integrating them with Kubernetes, but that's the general direction you probably want to go.

^* Unlike with people, where it's more likely to work the other way around. :-)

answered Mar 11 at 17:50

John Bollinger

190k11 gold badges102 silver badges206 bronze badges

1 Comment

Bernard Halas Mar 11 at 19:57

Thank you for the elaboration. As a matter of fact, we're using dumb-init for SIGTERM handling in this particular container. It would be interesting to swap it for tini for the test, in case tini reacts faster. The container eats up the 4K limit of running threads across several 10s of minutes, so if dumb-init had the capability of reaping zombies, it I see no reason yet why that hasn't been happening. BTW, it works well for SIGTERM forwarding.

Collectives™ on Stack Overflow

How to capture STDOUT, STDERR and process PID, without creating zombies

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related