Bug 845 - Received disconnect from ???: 2: Corrupted MAC on input.
Summary: Received disconnect from ???: 2: Corrupted MAC on input.
Status: CLOSED WONTFIX
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: sshd (show other bugs)
Version: 3.8p1
Hardware: UltraSPARC Solaris
: P2 normal
Assignee: OpenSSH Bugzilla mailing list
URL:
Keywords:
: 860 2941 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-04-21 01:55 AEST by David Annis
Modified: 2021-04-23 15:09 AEST (History)
7 users (show)

See Also:


Attachments
Webserver (target) sshd_config (2.51 KB, text/plain)
2004-04-21 01:58 AEST, David Annis
no flags Details
Source machine ssh_config (1.13 KB, text/plain)
2004-04-21 02:00 AEST, David Annis
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Annis 2004-04-21 01:55:04 AEST
When copying small files (html pages) to web servers, we occasionally get a 
failure message: 

Received disconnect from xx.xx.xx.xx: 2: Corrupted MAC on input. 
(xx.xx.xx.xx is the IP address of one of the web servers)

We copy 6 files to 2 machines every hour. The failure happens 2 or 3 times a 
day. It's a different file each time, and occurs on either target machine. We 
have this failure using scp, sftp, and streaming tar file through stdin to ssh.

The source machine is AIX 5.1 ML05 using openssh 3.8p1, corporate internal DNS.

The target machines are SunOS 5.8, kernel 108528-18, They are outside our 
firewall, using our ISP's DNS. The source machine is not in the ISP's DNS. It 
is in the /etc/hosts file. X11Forwarding is turned off in sshd_config on the 
target machines.

The firewall's NICs are Sun qfe cards. The network switches are Cisco 6500
Comment 1 David Annis 2004-04-21 01:58:25 AEST
Created attachment 605 [details]
Webserver (target) sshd_config

sshd_config from target machine (where corrupt MAC error occurs)
Comment 2 David Annis 2004-04-21 02:00:36 AEST
Created attachment 606 [details]
Source machine ssh_config

ssh_config from source machine (receives disconnect message from target
webserver).
Comment 3 Darren Tucker 2004-04-30 00:16:01 AEST
This is usually a problem on the network between client and server, but has also
been reported to be caused by bad RAM in either client or server.  The fact that
it's not consistent makes it unlikely to be a software problem.  How big are the
files, and what kind of network gear do you have between client and server?

Also see bug #510.
Comment 4 David Annis 2004-04-30 01:50:00 AEST
The source machine is an IBM P650 with standard IBM ethernet controller.
The firewall and web servers are Sun with Sun qfe cards.
The switches are all Cisco 6500

I NEVER get this error going through the same infrastructure to machines in the 
DMZ's. It ONLY happens going to the external web servers.
Comment 5 David Annis 2004-04-30 01:51:42 AEST
The 6 files in question range from 70 KB to 170 KB. I've also tested this as 
one file of about 500 KB.
Comment 6 Darren Tucker 2004-04-30 16:20:50 AEST
Which cipher are you using?  Does selecting a different cipher make any difference?

Please attach (ie use "create a new attachment") a complete debug trace ("scp
-vvv [options]") of a failed session.
Comment 7 Darren Tucker 2004-05-07 10:52:01 AEST
*** Bug 860 has been marked as a duplicate of this bug. ***
Comment 8 David Hodgson 2004-10-25 23:56:39 AEST
My three haporth worth ... this appears to be a common problem, having 
searched the internet for this error message.  I was also receiving this 
error, and believe I have fixed it ...

The WinXP laptop I am using to connect to the Linux file server using X Window 
has two addresses (and so two DNS names), one for cable and the second for 
wireless.  If the machine name does not correspond to the DNS name of the 
network interface, I get this error.  If I change the machine name so 
everything matches the error doesn't occur.  Repeatedly changing the machine 
name is a big pain ...

Using either OpenSSH 3.9p1 under Cygwin or PuTTY 0.55 to connect to OpenSSH 
3.5P1 on SuSE Linux 8.2.  Same error with both.

If you tell me what logs and configs you want, I can send them ...
Comment 9 David Hodgson 2004-10-26 23:05:54 AEST
My apologies ... spoke too soon.  Not getting bad MAC now, but instead getting 
Disconnecting: Bad packet length 3428026913.

Same result.
Comment 10 Darren Tucker 2004-11-23 16:54:56 AEDT
rapier at tyranny com about reports tracking down another cause of these errors:

"Anyway, the problem we found with the corrupted mac on input was a result of
what seems to be a hardware bug in intel e1000 drivers. we since switched to
sysconnect cards and the problems went away (only to be replaced by another one
thats caused by a memory resources starvation issue when the system is under
high IO loads). Basically, the HMAC stuff seems pretty rock solid so if people
see this sort of error consistantly they should probably look at the drivers,
hardware, or cabling."
Comment 11 Bogdan 2011-01-18 01:34:19 AEDT
I apologize if this is a silly question, but why is the connection killed when this happens instead of the affected message being retransmitted?

IIRC when a TCP checksum fails the TCP stack will retry sending that packet. As long as SSH has, for all intents and purposes, a better checksum (the MAC), why doesn’t it do the same when for some reason the TCP one fails? It doesn’t seem that it would be that hard, since it doesn’t have to do everything else TCP does, only retry packets with failed MACs.

If I understand correctly the situation, the main source of these bugs are bad network stacks—in my case, I suspect the impossible-to-disable rx/tx checksum offloading function of my network adapter is to blame—    but this can happen, albeit rarely, even when the entire TCP stack functions as designed: the TCP checksum can fail to detect a transmission error, and on a noisy transmission medium it can happen often enough. As far as I know occasionally corrupted packages are considered “normal” in TCP, not grounds for terminating the connection.
Comment 12 Devin Reade 2011-05-03 14:08:48 AEST
[More details for posterity]

For what it's worth, a few months back I found myself dealing with this situation in a couple of variants.  In one case, one end of the SSH session was to a VM in a Xen environment.  In another case, one end of the SSH session was to a VM in a VMWare ESXi environment.

Copying anything via scp or sfp was almost impossible, although interactive shells usually worked.

In both cases, after lots of diagnosis and "google research" I was able to determine that the underlying cause seemed to be a faulty TCP segment offload mechanism in the underlying virtualized network layer.  (In one case, fingers were pointed at a virtual switch, in the other to the virtual NIC.)  Either way, it appears that the VM's kernel was offloading checksumming to the lower layers, but none of the lower layers actually bothered to do it.

Disabling TCP segment offload in the upper level of the network stack (that of the VM OS) solved the problem and the systems have been fine since then.

This *does* tend to indicate that it's not an SSH problem per se.
Comment 13 Darren Tucker 2018-12-07 16:09:35 AEDT
*** Bug 2941 has been marked as a duplicate of this bug. ***
Comment 14 Darren Tucker 2018-12-07 16:13:02 AEDT
For the record, another known cause of this is buggy network device firmware.  Historically Linksys devices are known to have issues in some versions of their firmware, see bug#510.
Comment 15 Dan 2018-12-07 16:35:35 AEDT
I have this problem all the time (Bug 2941) when running scp on my Ubuntu 16.04 box at home, to copy a multi-gigabyte file from a Linode server. (And never in the other direction, from home to Linode.)

I can transfer exactly the same files from other hosts to my home computer (e.g., from a Macintosh in my home), so it seems the problem is not on my home Ubuntu box, nor in my home network setup. I guess it's on Linode?
Comment 16 Dan 2018-12-07 16:36:53 AEDT
The error I get is ssh_dispatch_run_fatal: Connection to 11.22.33.44 port 33333: message authentication code incorrect.
Comment 17 Darren Tucker 2018-12-07 16:53:58 AEDT
(In reply to Dan from comment #15)
[...]
> I can transfer exactly the same files from other hosts to my home
> computer (e.g., from a Macintosh in my home), so it seems the
> problem is not on my home Ubuntu box, nor in my home network setup.

It's not clear from the description but unless one of those tests traverses from inside to outside your home network it could still be your home network.

> I guess it's on Linode?

Maybe.  I'd suggest testing a large transfer (of something that you would not mind disclosing, since it'll be in clear text) with something like netcat then sha256 source and destination files and see if they match.  That'll eliminate the variables of ssh and libcrypto.
Comment 18 Dan 2018-12-08 11:59:35 AEDT
OK, I ran a bunch of tests on my home network, which consists of an  ActionTec MI424WR router (for Verizon FIOS) connected to a Cisco 24-port switch (Procure 1400-24G), wired by CAT-6 to the various computers.

I ran scp on my Linux box (Ubuntu 16.04), trying to copy huge files from another host.

It works when the remote host is another computer on my  home network (a Mac). So the problem can't be the Cisco switch or the Linux computer.

It fails when the remote host is on the internet. I tried two different internet Linux hosts (one Linode, one HostDime). Both scp operations died partway through the transfer. The Linux box said:

  ssh_dispatch_run_fatal: Connection to xx.xx.xx.xx port xxx: message authentication code incorrect

and the Mac said:

  Corrupted MAC on input.
  Disconnecting: Packet corrupt

So the problem is either the ActionTec router, the Verizon hookup in my basement, or something outside my home. Wheeee.
Comment 19 Castro B 2019-06-29 00:07:54 AEST
Are you still have a hard time? its working for me now

Castro B,
http://internetvergelijken.nl
Comment 20 Dan 2019-06-29 04:15:46 AEST
This ticket can be closed. I upgraded to a newer router and the problem disappeared.
Comment 21 Darren Tucker 2019-08-21 15:46:21 AEST
Another thing found to cause this in at least one case: ssh's hardware-accelerated GCM (aes128-gcm@openssh.com) running on the same CPU (a Xeon E5-2620 v4) as a hardware-accelerated sha1sum process.  Switching the ssh session to chacha20-poly1305@openssh.com worked around the problem (presumably since it avoids HW acceleration entirely).

The error message has change due to some refactoring ("ssh_dispatch_run_fatal: Connection to [...] port 22: message authentication code incorrect") but it means the same thing.
Comment 22 Damien Miller 2019-08-21 22:19:51 AEST
Do you know what platform that was? That smells like a kernel bug, failing to save/restore SSE registers.
Comment 23 Darren Tucker 2019-08-21 22:23:50 AEST
(In reply to Damien Miller from comment #22)
> Do you know what platform that was? That smells like a kernel bug,
> failing to save/restore SSE registers.

It's a Linux of some flavour, but it's one of many similar machines and it's the only one affected, and it seems at least somewhat sensitive to which cpu core the sha1sum schedules on, so my bet would be a faulty CPU.
Comment 24 Damien Miller 2021-04-23 15:09:43 AEST
closing resolved bugs as of 8.6p1 release