Bug 2598 - ssh-agent very occasionally won't remove keys or certs despite now() >= lifetime
Summary: ssh-agent very occasionally won't remove keys or certs despite now() >= lifetime
Status: CLOSED INVALID
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: ssh-agent (show other bugs)
Version: 6.9p1
Hardware: amd64 Mac OS X
: P5 minor
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-07-16 05:23 AEST by Peter Moody
Modified: 2016-08-02 10:41 AEST (History)
1 user (show)

See Also:


Attachments
add debugging to ssh-agent (1.52 KB, patch)
2016-07-17 19:49 AEST, Darren Tucker
no flags Details | Diff
add debugging to ssh-agent (1.53 KB, patch)
2016-07-19 14:02 AEST, Darren Tucker
no flags Details | Diff
ssh-agent log (10.48 KB, application/octet-stream)
2016-07-20 01:21 AEST, Peter Moody
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Moody 2016-07-16 05:23:30 AEST
apologies for the vagueness of this report.

I add these ssh certs (and keys) to the ssh-agent with a lifetime set to when the cert will expire, eg. 72k seconds. Very occasionally, an ssh-agent process won't actually remove the cert when the timer expires. These are exclusively laptops so my first thought was that maybe the laptop was asleep when the timer expired, but I've had a look through the ssh-agent code and it looks like reaper() checks now >= death for every entry. I've also been able to run 'ssh-add -l' (which looks like it forces a call to reaper, presumably expiring all keys with now >= death), and the certs still aren't removed.

Is my assumption wrong about reaper() being called every time 'ssh-add -l' is invoked? If it is called every time, is there anyway short of id->death getting set to 0 that a key could dodge removal? I guess it's possible that my ca is actually adding a lifetime that's much longer than I think it is, but I suspect I'd see a lot more if this if that were the case.

I'm totally confused. :-/
Comment 1 Darren Tucker 2016-07-16 16:15:36 AEST
WRT suspends: we changed the timer to CLOCK_MONOTIME in this commit:

https://anongit.mindrot.org/openssh.git/patch/?id=b759c9c2

so it'll do the right thing over clock steps.  We added CLOCK_BOOTTIME in

https://anongit.mindrot.org/openssh.git/patch/?id=795b8631

but that was added in 6.7, so it shouldn't be a problem with the 6.9 version you're using.

reaper is being called when you run ssh-add -l, but after the request is process and before it runs select() again.  I could imagine a timer expiring while the system is asleep and the key not being removed until after the next time it processes a request.  If that's the case, the key would only show up the first time you ran ssh-add -l.

Failing that, if we add some debugging could you capture the logs for one of the problem cases?
Comment 2 Darren Tucker 2016-07-16 16:22:55 AEST
oh, mac os.  Does that have CLOCK_BOOTTIME?
Comment 3 Peter Moody 2016-07-17 02:59:37 AEST
osx doesn't appear to have clock_gettime(2), and I can't find any references to CLOCK_BOOTTIME in /usr/include. if monotime() is just using time(2), does that help isolate the issue?

If you give me a patch and an idea of what steps you think might tickle this bug (eg. multiple keys, some constrained and some not, keys expiring when the laptop is sleeping, etc), I'd be happy to test.

I do have one user that this has happened to twice in the last week and I might be able get him to replace his ssh-agent and see if something about his regular workflow just tickles this bug.
Comment 4 Peter Moody 2016-07-17 03:03:35 AEST
one other thing. If I were to run ssh-agent -d and then suspend and resume the process, does that make ssh-agent behave in a similar manner to if the machine hibernated? Or do you think there might be something special about being flushed to disk, etc? I'm just trying to think of ways to test this on a machine which so far hasn't had this happen.

Cheers,
peter
Comment 5 Darren Tucker 2016-07-17 18:49:11 AEST
(In reply to Peter Moody from comment #3)
> osx doesn't appear to have clock_gettime(2), and I can't find any
> references to CLOCK_BOOTTIME in /usr/include. if monotime() is just
> using time(2), does that help isolate the issue?

maybe, I'll have a think about it.

> If you give me a patch and an idea of what steps you think might
> tickle this bug (eg. multiple keys, some constrained and some not,
> keys expiring when the laptop is sleeping, etc), I'd be happy to
> test.
> 
> I do have one user that this has happened to twice in the last week
> and I might be able get him to replace his ssh-agent and see if
> something about his regular workflow just tickles this bug.

(In reply to Peter Moody from comment #4)
> one other thing. If I were to run ssh-agent -d and then suspend and
> resume the process, does that make ssh-agent behave in a similar
> manner to if the machine hibernated?

It's worth a try but my guess is that it won't.

> Or do you think there might be
> something special about being flushed to disk, etc? I'm just trying
> to think of ways to test this on a machine which so far hasn't had
> this happen.

I doubt it's disk flushes have anything to do with it (I don't think ssh-agent even includes stderr in the descriptors it selects on.

A couple of questions:
 - when it happens, if you run ssh-add -l twice are the keys present in both?
 - is there anything else going on with clocks, eg ntpd?  if so, are there any clock steps logged?
Comment 6 Darren Tucker 2016-07-17 19:49:38 AEST
Created attachment 2852 [details]
add debugging to ssh-agent

This patch against 7.2p2 adds some more debugging to ssh-agent, including outputting the current time() and monotime() values on every log line.

I'm not sure what to look for so I'm not sure if it'll be enough.
Comment 7 Peter Moody 2016-07-19 03:30:52 AEST
Thanks, Darren. I'm running the patched ssh-agent now.

fwiw, I apparently *can* repro this on my machine (I'd only gotten reports of this from other people before).

this is on my system-provided agent.

$ usshcertstatus
ussh cert good for -50h-16m

waiting for the cert/key to expire on the patched version now

$ env SSH_AUTH_SOCK=/tmp/ssh.sock usshcertstatus
ussh cert good for 19h56m


>  - when it happens, if you run ssh-add -l twice are the keys present in both?
yes

$ ssh-add -l
2048 SHA256:xXX0cRWdec7IA43C0cSF+Y9JrKul2JBzgXk28NMLfEU [Valid until Sat 16 Jul 2016 15:01 UTC, Version 2] (RSA-CERT)
2048 SHA256:xXX0cRWdec7IA43C0cSF+Y9JrKul2JBzgXk28NMLfEU [Valid until Sat 16 Jul 2016 15:01 UTC, Version 2] (RSA)

$ ssh-add -l
2048 SHA256:xXX0cRWdec7IA43C0cSF+Y9JrKul2JBzgXk28NMLfEU [Valid until Sat 16 Jul 2016 15:01 UTC, Version 2] (RSA-CERT)
2048 SHA256:xXX0cRWdec7IA43C0cSF+Y9JrKul2JBzgXk28NMLfEU [Valid until Sat 16 Jul 2016 15:01 UTC, Version 2] (RSA)

$ ssh-add -l
2048 SHA256:xXX0cRWdec7IA43C0cSF+Y9JrKul2JBzgXk28NMLfEU [Valid until Sat 16 Jul 2016 15:01 UTC, Version 2] (RSA-CERT)
2048 SHA256:xXX0cRWdec7IA43C0cSF+Y9JrKul2JBzgXk28NMLfEU [Valid until Sat 16 Jul 2016 15:01 UTC, Version 2] (RSA)

$ ssh-add -l
2048 SHA256:xXX0cRWdec7IA43C0cSF+Y9JrKul2JBzgXk28NMLfEU [Valid until Sat 16 Jul 2016 15:01 UTC, Version 2] (RSA-CERT)
2048 SHA256:xXX0cRWdec7IA43C0cSF+Y9JrKul2JBzgXk28NMLfEU [Valid until Sat 16 Jul 2016 15:01 UTC, Version 2] (RSA)

$ usshcertstatus
ussh cert good for -50h-21m


>  - is there anything else going on with clocks, eg ntpd?  if so, are there any clock steps logged?

I don't see any likely ntp errors in the logs. The only slight weirdness with my setup here is that I believe my laptop was asleep for most of the weekend. it looks like this key/cert pair should've been removed on 16 July at ~08.25 my latop diagnostic logs go from 16 July at 4.57 to 16 July 9.44. time(NULL) couldn't wrapping around, could it  .. ?

Anyway, I'll let you know the results from the instrumented ssh-agent. Thanks!
Comment 8 Peter Moody 2016-07-19 04:39:48 AEST
adding a key with a 60 second lifetime and suspending for 40 minutes didn't work.

debug1[1468864204,1468864204]: type 25
debug3[1468864204,1468864204]: constrain lifetime key '/Users/pmoody/.ssh/uber_rsa' seconds 60 death time 1468864264
debug3[1468864204,1468864204]: key '/Users/pmoody/.ssh/uber_rsa' death time 1468864264
debug3[1468864204,7237125641618659020]: reaper deadline 60
debug3[1468864204,140734662252236]: key '/Users/pmoody/.ssh/uber_rsa' death time 1468864264
debug3[1468864204,140734662252236]: key '/Users/pmoody/.ssh/uber_rsa' death time 1468864264
debug3[1468864204,3419202321690465996]: reaper deadline 60
debug3[1468864204,140734662252236]: key '/Users/pmoody/.ssh/uber_rsa' death time 1468864264
debug1[1468864204,3419202321690465996]: XXX shrink: 3 < 4
debug3[1468864204,140734662252236]: key '/Users/pmoody/.ssh/uber_rsa' death time 1468864264
debug3[1468864204,3419202321690465996]: reaper deadline 60
'./ssh-agent -d -a /tmp/ssh.sock' has stopped

$ fg
Send job 1, './ssh-agent -d -a /tmp/ssh.sock' to foreground
debug1[1468866660,140734662254692]: expiring key '/Users/pmoody/.ssh/uber_rsa'
debug3[1468866660,8747515637889048676]: reaper deadline 0

I'll let you know what the long lived key does tomorrow.
Comment 9 Darren Tucker 2016-07-19 13:47:46 AEST
(In reply to Peter Moody from comment #7)
[...]
> this is on my system-provided agent.

Note that if your problem occurs only with the ssh-agent supplied with the OS, Apple makes a number of changes to ssh-agent to make it work with launchd and the keychain.  The sources are apparently here: http://opensource.apple.com//source/OpenSSH/ although I don't know how to map what you might be running to the corresponding tree.

> $ usshcertstatus
> ussh cert good for -50h-16m

what's "usshcertatus" ?

[...]
> Anyway, I'll let you know the results from the instrumented
> ssh-agent. Thanks!

Good luck.
Comment 10 Darren Tucker 2016-07-19 14:00:52 AEST
(In reply to Peter Moody from comment #8)
> debug3[1468864204,7237125641618659020]: reaper deadline 60

Sigh.  I missed a cast so the second time value is bogus.  I think from the earlier ones the values should always be equal so it's not worth stopping your experiment for, but if you run another you might want to use this updated diff.
Comment 11 Darren Tucker 2016-07-19 14:02:53 AEST
Created attachment 2854 [details]
add debugging to ssh-agent
Comment 12 Peter Moody 2016-07-20 01:20:06 AEST
The key didn't expire this morning with vanilla 6.9p1. attaching the agent.log now. I'll try your new patch against HEAD.

$ env SSH_AUTH_SOCK=/tmp/ssh.sock ssh-add -l
2048 SHA256:6kWrXTlSBCyHqfiPJHm5teVHIP0wH+RDWrVFk3FB4DE [Valid until Tue 19 Jul 2016 13:17 UTC, Version 2] (RSA-CERT)
2048 SHA256:6kWrXTlSBCyHqfiPJHm5teVHIP0wH+RDWrVFk3FB4DE [Valid until Tue 19 Jul 2016 13:17 UTC, Version 2] (RSA)

$ env SSH_AUTH_SOCK=/tmp/ssh.sock ssh-add -l
2048 SHA256:6kWrXTlSBCyHqfiPJHm5teVHIP0wH+RDWrVFk3FB4DE [Valid until Tue 19 Jul 2016 13:17 UTC, Version 2] (RSA-CERT)
2048 SHA256:6kWrXTlSBCyHqfiPJHm5teVHIP0wH+RDWrVFk3FB4DE [Valid until Tue 19 Jul 2016 13:17 UTC, Version 2] (RSA)

$ date -u
Tue Jul 19 14:20:00 UTC 2016

# about 30 minutes later
$ env SSH_AUTH_SOCK=/tmp/ssh.sock usshcertstatus
ussh cert good for -1h-56m

> what's "usshcertatus" ?

we call the ssh cert stuff, "ussh" (uber-ssh). usshcertstatus prints ssh-keygen -L -f type information for any certs signed by our CA. sort of like prodcertstatus for prodaccess.
Comment 13 Peter Moody 2016-07-20 01:21:34 AEST
Created attachment 2855 [details]
ssh-agent log
Comment 14 Peter Moody 2016-07-20 01:43:15 AEST
gah, this could all be PEBKAC. it looks like my non-expiring keys are all being added as unconstrained (type 17) for some reason.
Comment 15 Peter Moody 2016-07-20 02:05:59 AEST
Alright, this can be closed. the bug is somewhere I my code :(

sorry!
Comment 16 Damien Miller 2016-08-02 10:41:39 AEST
Close all resolved bugs after 7.3p1 release