Created attachment 3614 [details] patch with a kludge fix An update to net-misc/openssh-9.0_p1-r6 (Gentoo port) this morning led immediately to nondeterministic (~50% failure rate) connection failures, both in ssh and sshd, with this message: ssh_dispatch_run_fatal: Connection to a.b.c.d port 22: incorrect signature It quickly became apparent that the problem only affected connections authenticated with ED25519. I found, by instrumenting ed25519.c:crypto_sign_ed25519_open(), that bit #256 (of 255 valid bits) in sm[] was nondeterministically set or clear, whereas the corresponding bit in t2[] was always clear. I strongly suspect the new behavior is due to new compiler behavior, because (1) reverting to openssh-8.9_p1 fixed the problem even though all ED source code is identical, and (2) reverting to a rebuilt openssh-9.0_p1-r4 (Gentoo port revision again), which was functioning flawlessly before the rebuild, did not fix the problem. The compiler version for the openssh-9.0_p1-r4 that worked right was gcc-11.3.0, versus currently installed gcc-11.3.1_p20220909. For now I've kludged a fix for myself that just copies the spurious 256th bit from sm[31] to t2[31]. See attached patch. This is not the right way to fix the problem of course.
This glitch turns out to have been caused by building with the llvm-15.0.1+clang-15.0.1 toolchain (by accident/Gentoo portage/package bug). It's highly repeatable, and is either a bug in the compiler/toolchain, or a bug somewhere in openssh around the handling of ED25519 signature blobs and whatnot. The ssh and sshd built under clang-15 cause the "incorrect signature" failure for ED25519 on incoming connections to ED25519-hostkeyed sshd, and outgoing connections to e.g. github.com, which use ED25519 hostkeys. Building under gcc-11.3.1 resolves all ED25519 problems, inbound and outbound.
I have repro'd on Fedora with a clang-15.0.1 built from git. So far I've found that for some builds, "make t-exec LTESTS=connect" will reliably reproduce the error for me. In some cases, deleting ssh-ed25519.o and rebuilding will result in a binary that does not exhibit the problem (and such a binary will run the regress suite (at least until I got bored). To repro: $ CC=/opt/clang-15.0.1/bin/clang ../../configure $ for i in `seq 0 9`; do make clean >/dev/null 2>&1; make -j24 >/dev/null 2>&1; for j in `seq 0 9`; do if make t-exec LTESTS=connect >/dev/null 2>&1; then echo -n "good "; else echo -n "bad "; fi;done; echo; done bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad good good good good good good good good good good bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad good good good good good good good good good good bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad bad good good good good good good good good good good Now if I remove -fzero-call-used-regs=all from the Makefile with no other changes: $ for i in `seq 0 9`; do make clean >/dev/null 2>&1; make -j24 >/dev/null 2>&1; for j in `seq 0 9`; do if make t-exec LTESTS=connect >/dev/null 2>&1; then echo -n "good "; else echo -n "bad "; fi;done; echo; done good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good To me, this smells like clang 15.0.1 has some compile-time undefined behaviour in -fzero-call-used-regs.
I just ran into this in another context, so we need to get a minified testcase we can submit to clang as well as taking some action in configure.ac; either removing these flags entirely or turning them off for broken compiler versions
I created https://github.com/llvm/llvm-project/issues/59242 to address this issue.
It looks like Clang 15.0.6 was just release. There's a fix in it for a similar OpenSSH bug. Could you give it a try please? https://releases.llvm.org/
15.0.6 seems to have the same problem. $ /opt/clang-15.0.6/bin/clang --version clang version 15.0.6 (https://github.com/llvm/llvm-project.git 088f33605d8a61ff519c580a71b1dd57d16a03f8) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/clang-15.0.6/bin $ CC=/opt/clang-15.0.6/bin/clang ../../configure $ for i in `seq 0 9`; do make clean >/dev/null 2>&1; make -j24 >/dev/null 2>&1; for j in `seq 0 9`; do if make t-exec LTESTS=connect >/dev/null 2>&1; then echo -n "good "; else echo -n "bad "; fi;done; echo; done bad bad bad bad bad bad bad bad bad bad good good good good good good good good good good good good good good good good good good good good bad bad bad bad bad bad bad bad bad bad good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good good bad bad bad bad bad bad bad bad bad bad good good good good good good good good good good $ tail -4 regress/failed-ssh.log debug2: ssh_ed25519_verify: crypto_sign_ed25519_open failed: -1 ssh_dispatch_run_fatal: Connection to UNKNOWN port 65535: incorrect signature FAIL: ssh proxycommand connect failed
Created attachment 3628 [details] Avoid fzero-call-used-regs on possibly buggy clang versions Given that the problem seems to occur with several released clang versions I think we should avoid the problematic flag for all versions of clang-15. We can add more specific rules if we later learn of versions that work.
Until we have a fix, you might want to use `-fzero-call-used-regs=used` instead of `all`. The latter is a huge hammer and could zero out registers that are still in use up the call chain.
Thanks, that also passes the test above. I've changed the diff to pass "=used" instead and "=all" if clang 15 is detected and committed it as https://github.com/openssh/openssh-portable/commit/62cc33e6eed847aafdc29e34aa69e9bd82a0ee16.
Should be fixed by https://reviews.llvm.org/D139679. I'll see if we can backport it to 5.0.7.
I have cherry-picked the configure.ac change into the V_9_1 branch too, so if we do a 9.1p2 release it'll be in that too. If the clange fix makes it into clang 15 releases and it's important enough we can explicitly enable it on those, otherwise we'll just leave it until clang eventually moves to v16. Thanks all.
OpenSSH 9.3 has been released. Close resolved bugs