Created attachment 3615 [details] Change seccomp sandbox default action to ENOSYS From time to time, glibc changes its syscall wrappers to make use of new Linux kernel facilities. The strategy it uses for this is often to try more recently-introduced syscalls, but fall back to older ones if it gets ENOSYS, allowing it to cope gracefully with running on older kernel versions. Unlike (as I understand it) OpenBSD's pledge(2), sandboxing using Linux's seccomp inherently violates the abstraction layer of C library calls to at least some extent, forcing programs that use it to keep track of changes to the C library. While OpenSSH has been doing a reasonable job at keeping up with this, it's fragile and typically reactive; I've had to update OpenSSH in Debian stable releases in the past to keep up with new kernels, or sometimes edge cases on less widely-used architectures. (In the linked bug, Julian also points out that it can cause issues when running older userspace versions in containers or similar on top of newer host kernels, as you might expect from this class of problem.) I would like sshd to be less fragile here. The attached patch is one possible suggestion for making this less of a problem in future. It passes the regression tests here, but is otherwise definitely in the nature of an RFC.
(In reply to Colin Watson from comment #0) > From time to time, glibc changes its syscall wrappers to make use of > new Linux kernel facilities. The strategy it uses for this is often > to try more recently-introduced syscalls, but fall back to older > ones if it gets ENOSYS, allowing it to cope gracefully with running > on older kernel versions. Arbitrarily failing syscalls that do not normally fail has been the source of serious security vulnerabilities in the past (eg CVE-2000-0506). That's why the default action is "kill" instead of "fail" and others are considered on a case by case basis. > it's fragile and typically reactive You omitted "architecture dependent" :-) Our CI tests on amd64, i386, arm, arm64, mips, mipsel and riscv64 but it's impossible for us to cover every architecture/kernel/glibc combination. > sandboxing using > Linux's seccomp inherently violates the abstraction layer of C > library calls to at least some extent, forcing programs that use it > to keep track of changes to the C library. I think that's a pretty good argument that glibc should provide an interface that is usable by applications that does not have that layering violation. Even just being able to specify the filters by libc function name rather than syscall name would help a lot, however I suspect doing that would be challenging given that the kernel and glibc are developed independently.
(In reply to Darren Tucker from comment #1) > Arbitrarily failing syscalls that do not normally fail has been the > source of serious security vulnerabilities in the past (eg > CVE-2000-0506). That's why the default action is "kill" instead of > "fail" and others are considered on a case by case basis. I don't think this is _not_ an issue, and I agree it requires care - that's why I included the umask case - but I think we have more problems the other way round. > I think that's a pretty good argument that glibc should provide an > interface that is usable by applications that does not have that > layering violation. Even just being able to specify the filters by > libc function name rather than syscall name would help a lot, > however I suspect doing that would be challenging given that the > kernel and glibc are developed independently. Sure, but there seems little appetite to do this with actually-existing Linux and glibc (I certainly don't have time for that sort of multi-year project), so where does that leave us? Tracking syscall minutiae forever doesn't seem appealing.
(In reply to Colin Watson from comment #2) > (In reply to Darren Tucker from comment #1) > > [...]security vulnerabilities > > I don't think this is _not_ an issue, and I agree it requires care - > that's why I included the umask case - but I think we have more > problems the other way round. Those fail closed and are (eventually) reported and fixed. The alternative fails open and risks becoming an exploit. > > [fixing it in glibc] > > Sure, but there seems little appetite to do this with > actually-existing Linux and glibc (I certainly don't have time for > that sort of multi-year project), so where does that leave us? > Tracking syscall minutiae forever doesn't seem appealing. I don't have a good answer for that.
Created attachment 3640 [details] safer debugging for seccomp sandbox violations One thing we could do it make it easier to debug seccomp sandbox failures. Currently, these require a rebuild of OpenSSH and some signal-handler unsafe code (though I think its impact is limited to hung connections). This tries to make the sandbox violation debugging signal handler safe and AFAIK safe enough to keep enabled all the time. The only catch is that it requires stderr attached as every other option (syslog, monitor log socket) is either unavailable or requires signal handler unsafe syscalls. Example (inserting a random setuid() call into sshd.c): [djm@djm openssh]$ sudo /home/djm/cvs/openssh/sshd -Dep2222 -oPidFile=none -fnone Server listening on 0.0.0.0 port 2222. Server listening on :: port 2222. ssh_sandbox_violation: unexpected system call: arch:0xc000003e syscall:0x69 addr:0x7f9ad54dc405
Created attachment 3641 [details] another version, logging via monitor Here'e another version, it's a bit more complex but it preserves logging via the usual path by implementing the log writing using signal-handler safe code.