Bug 2492 - Incomplete output of child process
Summary: Incomplete output of child process
Status: CLOSED WORKSFORME
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: sshd (show other bugs)
Version: 7.1p1
Hardware: amd64 Linux
: P5 major
Assignee: Assigned to nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-08 02:39 AEDT by Volth
Modified: 2016-08-02 10:40 AEST (History)
3 users (show)

See Also:


Attachments
using openssh-client (23.82 KB, image/png)
2015-11-08 02:39 AEDT, Volth
no flags Details
using Apache Mina ssh-client (15.69 KB, image/png)
2015-11-08 02:40 AEDT, Volth
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Volth 2015-11-08 02:39:32 AEDT
Created attachment 2747 [details]
using openssh-client

I noticed that on Ubuntu-15.10 read() inside channel_handle_rfd() may return EAGAIN after the child process is exited.
OpenSSH server considers this as an error and breaks the connection without reading the whole output.
It looks like:
http://i.imgur.com/mLMk66c.png <-- using openssh-client
http://i.imgur.com/LNdNSRq.png <-- using Apache Mina ssh client
It happens only if pty is allocated.
I managed to fix the problem with the following patch: volth@4d284a3
Would be nice to see this bug fixed one way or another as it is a very annoing bug, it breaks the workflow of automated deployment tools.

PS. If you need a playground to reproduce the bug, you may try Ubuntu-15.10 instances on Google Cloud or DigitalOcean. The bug happens more often on the cloud machines. It happens on dedicated hardware as well but not so often.
Comment 1 Volth 2015-11-08 02:40:13 AEDT
Created attachment 2748 [details]
using Apache Mina ssh-client
Comment 3 Damien Miller 2015-11-09 13:05:42 AEDT
This is problematic because it relates to a slightly-hacky fix for the hang on exit described in bug #52.

https://bugzilla.mindrot.org/show_bug.cgi?id=52

The heuristic being used is described in this comment

https://bugzilla.mindrot.org/show_bug.cgi?id=52#c23

Unfortunately, your patch reverses this fix and AFAIK would reintroduce the hang on exit. There are reproduction cases attached to the above bug.
Comment 4 Volth 2015-11-09 16:38:51 AEDT
Hm.
I am not sure if the bug #52 still relevant to the modern Linuxes where systemd puts all children of sshd forming one session into a single cgroup. I am not able to reproduce the bug #52 with the patched openssh. Even if it is still reproducible, using cgroups would be the less hacky way to fix it than aborting the session on seeing EAGAIN.
Comment 5 Marc Aurele La France 2015-11-18 08:35:52 AEDT
I'm with Damien on this one.  Your "fix" can't possibly be correct.  For one, EAGAIN == EWOULDBLOCK.  Yes, even on your "modern Linuxes" with systemd, cgroups and other bloat.

Try as I might I can't reproduce this problem with 7.1p1.  I'd say more research needed to determine if this is really a problem with openssh.
Comment 6 Volth 2015-11-18 09:08:47 AEDT
I just showed how I solved the annoying problem - by applying this patch, building a custom .deb and installing everywhere on my premises on the machines with Ubuntu 15.10.
I do not insist (and cannot insist) on including my patch into openssh.
It is completely up to you.

Yes, the fresh github's master is affected by the bug #2492.
And Ubuntu 15.10 with my patch is unaffected by the bug #52.
Comment 7 Marc Aurele La France 2015-11-18 09:28:31 AEDT
... and I'm saying something else is causing the read() to return EAGAIN when it shouldn't (i.e. there is still data available).  That's what needs further investigation in your case.
Comment 8 Volth 2015-11-18 09:38:23 AEDT
I noticed that read() returns EAGAIN many times, openssh call read() again until it returns success (or other error).
I does not cause any problem unless the heuristic is applied, I mean the line "force = c->isatty && c->detach_close && c->istate != CHAN_INPUT_CLOSED;"

If there is no pty allocated - there is no problem with EAGAIN, read() returns EAGAIN, openssd retries.
Until openssh get the exitcode of the child process - there is no problem with EAGAIN, read() returns EAGAIN, openssd retries.

The problem is only when openssh gets the the exitcode (thus set c->detach_close to true) and only since this moment EAGAIN becomes illegal. But it happens. It happens before openssh gets the the exitcode as well it happens after openssh gets the the exitcode.
Comment 9 Marc Aurele La France 2015-11-18 10:01:51 AEDT
So the child is posting the last of its data AFTER its parent has been notified of its death?  That makes no sense.  There's a timing (or buffering) issue here.
Comment 10 Volth 2015-11-18 11:18:52 AEDT
Yes.
Even of you cannot reproduce the issue in a very visible form, if you log the exchange between ssh client and server, you may notice that almost always the server sends "exit-status" message before the last piece of the data (even without pty).

If that last read() returns success, this reordering makes no harm (I tend not to see the reordering as a bug).

You can consider the reordering is a bug and try to fix it, then it will automatically fix this bug.

On Ubuntu 15.04 there is no reordering, it is something introduced in Ubuntu 15.10. Perhaps a bigger buffer.
Comment 11 Marc Aurele La France 2015-11-19 03:16:11 AEDT
I'd expect the ordering of events within the server/client conversation to vary.  I wouldn't read too much into it.

But that's not the issue here.  The problem is that the kernel is telling the parent there is no more data from the child when in fact there is.

So does 7.1p1 behave the same on 15.04?  If not, I'd say you have some kernel digging to do.
Comment 12 Volth 2015-11-19 04:10:48 AEDT
EAGAIN does not mean "there is no more data" it means there is data, please try to read it a bit later.
Comment 13 Marc Aurele La France 2015-11-19 08:55:21 AEDT
EGAIN means there is no data available at the time of the call.  But, in the case here, this contradicts the fact that the child has previously written the last of its data, closed the tty and signalled the parent about its demise.  Therefore, there >is< data available, but for whatever reason 15.10's kernel decides not to return it.

So, I'll ask (e)again: does 7.1p1 exhibit the same behaviour on 15.04 (which runs a much older kernel) or not?
Comment 14 Volth 2015-11-19 12:19:25 AEDT
I have not tried this combination.
Comment 15 Volth 2015-11-19 12:23:19 AEDT
(In reply to Marc Aurele La France from comment #13)
> EGAIN means there is no data available at the time of the call. 
> But, in the case here, this contradicts the fact that the child has
> previously written the last of its data, closed the tty and
> signalled the parent about its demise.  Therefore, there >is< data
> available, but for whatever reason 15.10's kernel decides not to
> return it.

I am not sure the data must be available.
Note that /bin/ping is not spawned by openssh directly, the data to be copied (perhaps lazily) through the chain of stdouts.

> So, I'll ask (e)again: does 7.1p1 exhibit the same behaviour on
> 15.04 (which runs a much older kernel) or not?

I have not tried this combination.
Comment 16 Volth 2015-11-19 12:30:10 AEDT
Among all Ubuntus, Debians and CoreOSes on Google Cloud Engine:
1. With the regular version of openssh the problem is only on Ubuntu 15.10 (one may say "only with Kernel 4.xx")
2. On Ubuntu 15.10 the problem is reproducible not only with regular version but also with github master and older versions (random version from github master 1, 2 and 3 years ago)
There was no reason for trying different openssh version on Ubuntu 15.04
Comment 17 Marc Aurele La France 2015-11-19 14:18:18 AEDT
1) If you are using this Bugzilla to report a problem with openssh, then it must be against a stock version of it, not one polluted with Ubuntu-isms.

2) You state the problem only occurs on 15.10, regardless of openssh version.  This makes it quite conclusive where the real issue lies, not in openssh.
Comment 18 Volth 2015-11-20 00:58:37 AEDT
The machinations between open-source companies is not what I am engaged in nor willing to.
I had a problem, solved it and shared the solution to the people who my also experience.
All the rest is not my business.
Comment 19 Volth 2015-11-20 01:01:15 AEDT
> 2) You state the problem only occurs on 15.10, regardless of openssh
> version.  This makes it quite conclusive where the real issue lies,
> not in openssh.

I would agrue that your 9-years old heuristic eventually stopped working.
You could try to make a kernel patch to keep it supported, but I am not in it.
Comment 20 Marc Aurele La France 2015-11-21 01:00:30 AEDT
I'm still investigating this, but, in the meantime, I think that, even with your change, you might find that ...

ssh -t root@localhost '/bin/bash -c "/bin/ping -c3 8.8.8.8" >&2'

... (i.e. redirecting the child's output to stderr) is also problematic.

Your change is still incorrect, for a number of reasons, but I'm not too sure yet what to replace it with.
Comment 21 Marc Aurele La France 2015-11-24 13:58:22 AEDT
I've verified that your suggested fix does indeed reintroduce the hang-on-exit bug.  Log into your patched sshd and do ...

   sleep 30 & logout

This will cause your session to hang until the sleep finishes, which is essentially the original problem reported in comment #1 of bug #52.

I still think the problem you reported here is a kernel issue and have been doing some research.  I'm curious what `uname -a` says on your system.  If this is earlier than 4.0.5, please update to at least 4.0.5 (wily's git currently sits at above 4.2.3) and see if the partial output persists.  If you're already at or above 4.0.5, then more research needs to be done.

4.0.5 is where http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/drivers/tty/pty.c?id=1a48632ffed61352a7810ce089dc5a8bcd505a60 first appeared.  This changes reportedly fixes all known issues with the PTY code that were introduced in 3.18 and before.
Comment 22 Volth 2015-11-24 15:27:56 AEDT
4.2.0-17
You may want to register on https://cloud.google.com/
There is 60-day free trial and it is possible to start virtual machines with different Linuxes with default sessings.
When I named different versions of Ubuntu and Debian I meant virtual machine on Google Cloud.
Comment 23 Marc Aurele La France 2015-11-25 02:20:23 AEDT
No.  What does `uname -a` actually say.  Anyting else, I've found, is smoke and mirrors.  Thanks.
Comment 24 Volth 2015-11-25 02:44:41 AEDT
(In reply to Marc Aurele La France from comment #23)
> No.  What does `uname -a` actually say.  Anyting else, I've found,
> is smoke and mirrors.  Thanks.

Linux g1.example.com 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 15:35:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Comment 25 Volth 2015-11-25 02:52:59 AEDT
(In reply to Volth from comment #24)
> (In reply to Marc Aurele La France from comment #23)
> > No.  What does `uname -a` actually say.  Anyting else, I've found,
> > is smoke and mirrors.  Thanks.
> 
> Linux g1.example.com 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8
> 15:35:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Linux s-tserver12.lan 4.2.0-17-generic #21-Ubuntu SMP Fri Oct 23 19:56:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux q-datanode8.lan 4.2.0-18-generic #22-Ubuntu SMP Fri Nov 6 18:25:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Comment 26 Volth 2016-04-24 20:11:52 AEST
Reproducible with Ubuntu 16.04 LTS (kernel-4.4.0-21 openssh-7.2p2-4)
Comment 27 Marc Aurele La France 2016-04-26 09:39:10 AEST
This issue is recognised as a kernel regression and is still being discussed on LKML.
Comment 28 Marc Aurele La France 2016-05-24 08:33:22 AEST
The fix for this has now made it into the mainline Linux kernel.  It'll be included in 4.7.  See

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0f40fbbcc34e093255a2b2d70b6b0fb48c3f39aa
Comment 29 Damien Miller 2016-05-24 13:50:18 AEST
Marc, thank you very much indeed for chasing this all the way down.
Comment 30 Darren Tucker 2016-07-20 14:59:40 AEST
Since this has been fixed in the kernel closing this bug.  Thanks for your work!
Comment 31 Marc Aurele La France 2016-07-26 09:08:36 AEST
The fix has now been incorporated into the following Linux mainline kernels:  4.1.26, 4.4.12, 4.5.6, 4.6.1 and the newly released 4.7.
Comment 32 Damien Miller 2016-08-02 10:40:36 AEST
Close all resolved bugs after 7.3p1 release