Bug 2000 - when using ssh with ControlMaster/ControlPersist, one may get zombie processes
Summary: when using ssh with ControlMaster/ControlPersist, one may get zombie processes
Status: CLOSED DUPLICATE of bug 1988
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh (show other bugs)
Version: 5.9p1
Hardware: All All
: P2 major
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-25 22:50 AEST by Christoph Anton Mitterer
Modified: 2018-04-06 12:26 AEST (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christoph Anton Mitterer 2012-04-25 22:50:58 AEST
Hi.

This is basically from:
https://dev.icinga.org/issues/2546
http://tracker.nagios.org/view.php?id=321

It was suggested there, that the actual problem may be in ssh, so I open a bug here, too.

I basically asked before already at:
https://lists.mindrot.org/pipermail/openssh-unix-dev/2012-April/030379.html
but no one knew any advise.


What I do is using Icinga/Nagios and having checks on remote hosts executed via ssh.
In order to dramatically speed checks up (from about 0,300 ms to 0,010 ms) I use ControlMaster = auto, which also makes the mux process spawned on the first check.
As checks are typically sequentially scheduled I want the mux process to persist but it should also go away automatically after some days if not re-used (e.g. when I don't check a host anymore).
So I have something like ControlPersist 2d.

Now I stumbled across the following problem (and I'm actually not sure
whether it's a ssh issue or Icinga/Nagios):
The first time the check is done (which is when the mux process is
spawned) it times out.
The mux process keeps running and everything works on subsequent
checks.

The timeout is one enforced by Icinga/Nagios (60s), when it thinks the command doesn't return.

I made some checks and the following turns out to happen on the FIRST connection:
- executing the command on the remote side is actually done
- on the local side, the ssh process (or a wrapper shell script around) becomes a zombie as soon as the remote command was executed
- after 60s, when Icinga/Nagios enforces it's timeout, the zombie goes away
- the (local) mux process continues to run

The Zombie process is a child of icinga/nagios, while the mux process (which is not a zombie) is a child of init.

Any ideas why this could happen? Is there perhaps something that lets the parent processes notice that there is still a running child (i.e. the mux process)?

I'll happily try out anything needed :)

Thanks,
Chris.
Comment 1 Tomas Mraz 2012-04-26 00:16:00 AEST
Congrats, you won the B2K contest! :)
Comment 2 Christoph Anton Mitterer 2012-04-26 00:40:08 AEST
*G* What did I win?! The next release is named in my honours?! ;-P
(Actually I'd prefer help on that issue much more ;) )
Comment 3 Tomas Mraz 2012-04-26 01:04:59 AEST
I'd say that the primary issue is in nagios and/or icinga in how it handles the child process execution and collection of the SIGCHLDs and wait-ing for the exit statuses. However there might also play a role stray stderr file descriptor which is left open on the mux process.

You can try to redirect the stderr to /dev/null when you're running the ssh from the wrapper script.
Comment 4 Christoph Anton Mitterer 2012-05-16 05:33:04 AEST
I'd suspect you're right.

In the meantime I've made a small test prog, that just forks and does nothing else special (which ssh could potentially do) and the problem happens, too.

So I'll close the bug for now as invalid and reopen it in case anything should be needed from ssh side.

Thanks,
Chris.
Comment 5 Damien Miller 2015-08-11 23:05:38 AEST
Set all RESOLVED bugs to CLOSED with release of OpenSSH 7.1
Comment 6 Christoph Anton Mitterer 2016-04-28 09:30:21 AEST
Reopening this for now:

Could someone of the developers have a look at the most recent comments at: http://tracker.nagios.org/view.php?id=321
gordonmessmer had a look at the issue and may have found the reason.

He indicated that the reason is possibly that the forked mux process inherits the stdout/stderr from the invoking ssh process and that causes the troubles.

He further wonders, whether ssh's mux process shouldn't close these as a proper fix.

If not, feel free to close the issue again as INVALID. :)
Comment 7 Damien Miller 2016-04-28 19:01:52 AEST
If ControlPersist's stderr handling is incompatible with the way nagios manages its processes when why enable it?
Comment 8 Christoph Anton Mitterer 2016-04-28 22:16:54 AEST
Uhm? Well I thought that was obvious... using SSH to remotely execute checks seems desirable because it's a) the protocol meant for doing this b) secure (unlike e.g. NRPE).

But without control channel muxing, it's much slower than e.g. NRPE, as all the auth/kex/etc. needs to be done again for every connection, which doesn't scale when you run n checks per host...
ssh with control muxing is howver basically as fast as NRPE.
Comment 9 Damien Miller 2016-04-28 23:15:30 AEST
So use ControlMaster without ControlPersist?

*** This bug has been marked as a duplicate of bug 1988 ***
Comment 10 Christoph Anton Mitterer 2016-04-28 23:26:33 AEST
Hmm AFAIU, when not using ControlPersist, that would simply mean that if the last session is closed, the mux closes either, right?

This is however undesirable as well for the above use-case:

First, considers hosts where only one check runs (i.e. executed via ssh),... the check would finish, the mux would close, and next time, even if that next time was only some 30 secs later (perhaps when doing monitoring of things like CPU utilisation, via e.g. PNP4Nagios) it would have to do all the KEX/auth, etc. again. Bad.

Second, I would need to double check, but I think Icinga/Nagios/etc. serialise their checks per host, or at least try to evenly spread them.
So one would have again the situation, that without the persistence, the muxing is useless.


I think the ideal solution for the above use case is to have a persistence of e.g. 5 mins or so... depending on how regular checks are run (i.e. not unlimited).
In other words, normally the connection should stay open, but if no further checks would be run (e.g. Icinga stopped) it should close eventually.
Comment 11 Damien Miller 2018-04-06 12:26:35 AEST
Close all resolved bugs after release of OpenSSH 7.7.