Bug 3181 - ssh-agent doesn't exit automatically after child program exits
Summary: ssh-agent doesn't exit automatically after child program exits
Status: CLOSED WORKSFORME
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh-agent (show other bugs)
Version: 8.0p1
Hardware: All Cygwin on NT/2k/Win7-11
: P5 trivial
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-15 03:21 AEST by Serge B
Modified: 2021-04-23 15:10 AEST (History)
2 users (show)

See Also:


Attachments
Fallback to using a signal to see if parent process is still alive on Cygwin. (468 bytes, patch)
2020-06-19 13:56 AEST, Darren Tucker
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Serge B 2020-06-15 03:21:46 AEST
Man page for ssh-agent specifically documents the following:

"The agent exits automatically when the command given on the command line terminates."

But actual behaviour differs from that documented.

Steps to reproduce:

$ ps -ef | grep agent
user 20020 18231  0 17:11 pts/0    00:00:00 grep agent
$ ssh-agent /bin/bash
$ ps -ef | grep agent
user 20024 20023  0 17:11 ?        00:00:00 ssh-agent /bin/bash
user 20050 20023  0 17:12 pts/0    00:00:00 grep agent
$ exit
exit
$ ps -ef | grep agent

Expected result:
user 20054 18231  0 17:12 pts/0    00:00:00 grep agent
$

Actual result:
user 20024     1  0 17:11 ?        00:00:00 ssh-agent /bin/bash
user 20054 18231  0 17:12 pts/0    00:00:00 grep agent
$

ssh-agent process doesn't terminate.

Reproduced on 2 environments:

1. OpenCYGWIN_NT-10.0 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 CygwinSSH_8.1p1, OpenSSL 1.1.1d  10 Sep 2019


2. OpenSSH_8.0p1, OpenSSL 1.1.1c FIPS  28 May 2019
Linux 4.18.0-147.3.1.el8_1.x86_64 #1 SMP Fri Jan 3 23:55:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux 8 (Core)
Comment 1 Darren Tucker 2020-06-17 19:23:15 AEST
I suspect this might be fallout from the SA_RESTART change.  You can confirm this by "./configure --with-cflags=-DNO_SA_RESTART [other options]".  From local testing, it looks like ssh-agent still exits after at most 10 seconds after the child exits, does that match what you've seen?
Comment 2 Serge B 2020-06-17 19:34:01 AEST
1. OpenCYGWIN_NT-10.0 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 CygwinSSH_8.1p1, OpenSSL 1.1.1d  10 Sep 2019

Cygwin version remains live forever.


2. OpenSSH_8.0p1, OpenSSL 1.1.1c FIPS  28 May 2019
Linux 4.18.0-147.3.1.el8_1.x86_64 #1 SMP Fri Jan 3 23:55:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux 8 (Core)

Linux version do exit after some time.
Comment 3 Darren Tucker 2020-06-17 19:57:36 AEST
> CygwinSSH_8.1p1

NO_SA_RESTART was added in 8.3p1 (https://github.com/openssh/openssh-portable/commit/6c6072ba8), in versions prior to that it's a no-op.

> OpenSSH_8.0p1, OpenSSL 1.1.1c FIPS  28 May 2019
> Linux 4.18.0-147.3.1.el8_1.x86_64 

The SA_RESTART change first appeared in 8.2p1 (https://www.openssh.com/txt/release-8.2)

Perhaps it's not SA_RESTART.
Comment 4 Darren Tucker 2020-06-19 13:47:17 AEST
(In reply to Serge B from comment #2)
> 1. OpenCYGWIN_NT-10.0 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64
> CygwinSSH_8.1p1, OpenSSL 1.1.1d  10 Sep 2019
> 
> Cygwin version remains live forever.

OK, the reason for that is in check_parent_exists():
{
        /*
         * If our parent has exited then getppid() will return (pid_t)1,
         * so testing for that should be safe.
         */
        if (parent_pid != -1 && getppid() != parent_pid) {
                /* printf("Parent has died - Authentication agent exiting.\n"); */
                cleanup_socket();
                _exit(2);
        }

In this mode, when the ssh-agent forks the agent itself continues as the child while the parent execs the requested command which is unusual.  I'm not entirely sure why this is, but I suspect it's so job control works as expected on the executed command.

Based on local testing, on Cygwin getppid() continues to return the original process id even if that process has exited, instead of being "reparented" to init as it would be on a Unixlike system.  The best solution we could come up with was to fall back to "kill(getppid(), 0)" on Cygwin to test for parent process existence, however that'll be wrong in the case of pid reuse.  Corinna, do you have any better ideas?
Comment 5 Darren Tucker 2020-06-19 13:56:16 AEST
Created attachment 3412 [details]
Fallback to using a signal to see if parent process is still alive on Cygwin.
Comment 6 Corinna Vinschen 2020-06-29 20:48:56 AEST
Oh, wait!

Before applying a potentially dangerous patch to OpenSSH, did anybody
actually test this on the latest Cygwin 3.1.5?

I tried the setup from the original report with the latest from the Cygwin
distro:

- Cygwin 3.1.5
- OpenSSH 8.3p1
- OpenSSL 1.1.1f

and I can't reproduce this behaviour.  In my case ssh-agent exits after
not more than 10 secs.

If one of you still can reproduce this problem in the above setup,
with the latest of everything from the Cygwin repo, then before
patching OpenSSH, let me check if this isn't a regression in Cygwin's
process handling.  There's some history here in terms of ssh-agent.
An 8 years old patch in Cygwin is handling exactly this problem.  However,
there were some recent changes in Cygwin which might have broken this
behaviour.

So, if you still ca reporoduce under Cygwin 3.1.5 and OpenSSH 8.3p1, do you
have a simple testcase in plain C, by any chance, to reproduce the getppid
problem and to simplify debugging?


Thanks,
Corinna
Comment 7 Serge B 2020-07-11 21:11:05 AEST
I can confirm. The issue doesn't reproduce with fresh version of cygwin.

- Cygwin 3.1.6
- OpenSSH 8.3p1
- OpenSSL 1.1.1f
Comment 8 Darren Tucker 2020-07-13 16:56:41 AEST
(In reply to Corinna Vinschen from comment #6)
> Oh, wait!
> 
> Before applying a potentially dangerous patch to OpenSSH, did anybody
> actually test this on the latest Cygwin 3.1.5?

Well I didn't apply this potentially dangerous patch because I wasn't sure about it, which is why I asked for your opinion in comment #4 :-)  I did, however, add a regression test for this particular behaviour: https://github.com/openssh/openssh-portable/commit/5dba1fcabacaab46693338ec829b42a1293d1f52.

And while I do occasionally update Cygwin on my test VM, I didn't think to do it in this case since I'd done it somewhat recently and it's not the kind of behaviour that usually changes.  I updated it and indeed, the test now passes:

$ make t-exec LTESTS=agent-subprocess
[...]
run test agent-subprocess.sh ...
ok agent subprocess
make[1]: Leaving directory '/home/builder/openssh.anongit/regress'
all t-exec passed

Thanks for the help and sorry for the noise.
Comment 9 Damien Miller 2021-04-23 15:10:44 AEST
closing resolved bugs as of 8.6p1 release
Comment 10 Damien Miller 2021-04-23 15:10:47 AEST
closing resolved bugs as of 8.6p1 release