Man page for ssh-agent specifically documents the following: "The agent exits automatically when the command given on the command line terminates." But actual behaviour differs from that documented. Steps to reproduce: $ ps -ef | grep agent user 20020 18231 0 17:11 pts/0 00:00:00 grep agent $ ssh-agent /bin/bash $ ps -ef | grep agent user 20024 20023 0 17:11 ? 00:00:00 ssh-agent /bin/bash user 20050 20023 0 17:12 pts/0 00:00:00 grep agent $ exit exit $ ps -ef | grep agent Expected result: user 20054 18231 0 17:12 pts/0 00:00:00 grep agent $ Actual result: user 20024 1 0 17:11 ? 00:00:00 ssh-agent /bin/bash user 20054 18231 0 17:12 pts/0 00:00:00 grep agent $ ssh-agent process doesn't terminate. Reproduced on 2 environments: 1. OpenCYGWIN_NT-10.0 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 CygwinSSH_8.1p1, OpenSSL 1.1.1d 10 Sep 2019 2. OpenSSH_8.0p1, OpenSSL 1.1.1c FIPS 28 May 2019 Linux 4.18.0-147.3.1.el8_1.x86_64 #1 SMP Fri Jan 3 23:55:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux 8 (Core)
I suspect this might be fallout from the SA_RESTART change. You can confirm this by "./configure --with-cflags=-DNO_SA_RESTART [other options]". From local testing, it looks like ssh-agent still exits after at most 10 seconds after the child exits, does that match what you've seen?
1. OpenCYGWIN_NT-10.0 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 CygwinSSH_8.1p1, OpenSSL 1.1.1d 10 Sep 2019 Cygwin version remains live forever. 2. OpenSSH_8.0p1, OpenSSL 1.1.1c FIPS 28 May 2019 Linux 4.18.0-147.3.1.el8_1.x86_64 #1 SMP Fri Jan 3 23:55:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux 8 (Core) Linux version do exit after some time.
> CygwinSSH_8.1p1 NO_SA_RESTART was added in 8.3p1 (https://github.com/openssh/openssh-portable/commit/6c6072ba8), in versions prior to that it's a no-op. > OpenSSH_8.0p1, OpenSSL 1.1.1c FIPS 28 May 2019 > Linux 4.18.0-147.3.1.el8_1.x86_64 The SA_RESTART change first appeared in 8.2p1 (https://www.openssh.com/txt/release-8.2) Perhaps it's not SA_RESTART.
(In reply to Serge B from comment #2) > 1. OpenCYGWIN_NT-10.0 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 > CygwinSSH_8.1p1, OpenSSL 1.1.1d 10 Sep 2019 > > Cygwin version remains live forever. OK, the reason for that is in check_parent_exists(): { /* * If our parent has exited then getppid() will return (pid_t)1, * so testing for that should be safe. */ if (parent_pid != -1 && getppid() != parent_pid) { /* printf("Parent has died - Authentication agent exiting.\n"); */ cleanup_socket(); _exit(2); } In this mode, when the ssh-agent forks the agent itself continues as the child while the parent execs the requested command which is unusual. I'm not entirely sure why this is, but I suspect it's so job control works as expected on the executed command. Based on local testing, on Cygwin getppid() continues to return the original process id even if that process has exited, instead of being "reparented" to init as it would be on a Unixlike system. The best solution we could come up with was to fall back to "kill(getppid(), 0)" on Cygwin to test for parent process existence, however that'll be wrong in the case of pid reuse. Corinna, do you have any better ideas?
Created attachment 3412 [details] Fallback to using a signal to see if parent process is still alive on Cygwin.
Oh, wait! Before applying a potentially dangerous patch to OpenSSH, did anybody actually test this on the latest Cygwin 3.1.5? I tried the setup from the original report with the latest from the Cygwin distro: - Cygwin 3.1.5 - OpenSSH 8.3p1 - OpenSSL 1.1.1f and I can't reproduce this behaviour. In my case ssh-agent exits after not more than 10 secs. If one of you still can reproduce this problem in the above setup, with the latest of everything from the Cygwin repo, then before patching OpenSSH, let me check if this isn't a regression in Cygwin's process handling. There's some history here in terms of ssh-agent. An 8 years old patch in Cygwin is handling exactly this problem. However, there were some recent changes in Cygwin which might have broken this behaviour. So, if you still ca reporoduce under Cygwin 3.1.5 and OpenSSH 8.3p1, do you have a simple testcase in plain C, by any chance, to reproduce the getppid problem and to simplify debugging? Thanks, Corinna
I can confirm. The issue doesn't reproduce with fresh version of cygwin. - Cygwin 3.1.6 - OpenSSH 8.3p1 - OpenSSL 1.1.1f
(In reply to Corinna Vinschen from comment #6) > Oh, wait! > > Before applying a potentially dangerous patch to OpenSSH, did anybody > actually test this on the latest Cygwin 3.1.5? Well I didn't apply this potentially dangerous patch because I wasn't sure about it, which is why I asked for your opinion in comment #4 :-) I did, however, add a regression test for this particular behaviour: https://github.com/openssh/openssh-portable/commit/5dba1fcabacaab46693338ec829b42a1293d1f52. And while I do occasionally update Cygwin on my test VM, I didn't think to do it in this case since I'd done it somewhat recently and it's not the kind of behaviour that usually changes. I updated it and indeed, the test now passes: $ make t-exec LTESTS=agent-subprocess [...] run test agent-subprocess.sh ... ok agent subprocess make[1]: Leaving directory '/home/builder/openssh.anongit/regress' all t-exec passed Thanks for the help and sorry for the noise.
closing resolved bugs as of 8.6p1 release