The TCP SACK panic

By Jake Edge
June 19, 2019

Selective acknowledgment (SACK) is a technique used by TCP to help alleviate congestion that can arise due to the retransmission of dropped packets. It allows the endpoints to describe which pieces of the data they have received, so that only the missing pieces need to be retransmitted. However, a bug was recently found in the Linux implementation of SACK that allows remote attackers to panic the system by sending crafted SACK information.

Data sent via TCP is broken up into multiple segments based on the maximum segment size (MSS) specified by the other endpoint—or some other network hardware in the path it traversed. Those segments are transmitted to that endpoint, which acknowledges that it has received them. Originally, those acknowledgments (ACKs) could only indicate that it had received segments up to the first gap; so if one early segment was lost (e.g. dropped due to congestion), the endpoint could only ACK those up to the lost one. The originating endpoint would have to retransmit many segments that had actually been received in order to ensure the data gets there; the status of the later segments is unknown, so they have to be resent.

In simplified form, sender A might send segments 20-50, with segments 23 and 37 getting dropped along the way. Receiver B can only ACK segments 20-22, so A must send 23-50 again. As might be guessed, if the link is congested such that segments are being dropped, sending a bunch of potentially redundant traffic is not going to help things.

Selective acknowledgment was created as a mechanism to eliminate the redundant traffic. It came about in 1996 from RFC 2018. The idea is that receiver B can ACK 20-22, 24-36, and 38-50 so that A need only resend 23 and 37. It seems like common sense at some level; if someone read off a string of 30 words and you missed the third, you wouldn't ask them to repeat the list starting at the third word.

In order to keep track of all of that, the network subsystem has some bookkeeping to do. It is in this record keeping that the bug was found.

The struct sk_buff (typically called an SKB) is a kernel data structure that is used to hold network data of various sorts, including for transmit queues, receive queues, SACK queues, and more. For reference, networking maintainer David Miller has a nice overview (if somewhat dated) of how SKBs are used in the kernel. Part of the bookkeeping for TCP is to keep track of the 32KB (64KB on PowerPC) fragments that the TCP data stream has been broken up into; it is in the interaction between fragments and SACK where the kernel went astray.

The struct tcp_skb_cb is a control buffer that tracks various things about a TCP packet, including the number of segments/fragments it has been broken up into. It does so for the general segmentation offload (GSO) feature, which moves the segmentation of the packets as low as it can in the network stack, including possibly offloading it to the network hardware. The number of segments is stored in the tcp_gso_segs field, which is a two-byte unsigned integer. That works fine as long as the number of segments doesn't go beyond 64K.

But that is just what can happen when SACK has been agreed upon by the endpoints, which is done when the connection is established. An attacker can use a small MSS value (perhaps the minimum of 48 bytes, which only leaves eight bytes for actual user data) and cause an overflow of tcp_gso_segs by carefully choosing which segments to acknowledge. Multiple SKBs will be coalesced by the kernel in order to more efficiently process blocks of unacknowledged segments, but doing so could overflow tcp_gso_segs.

That overflow would cause a BUG_ON() in tcp_shifted_skb() to be hit, leading to a kernel panic. This was the most serious of four SACK-related bugs found by Jonathan Looney at Netflix. Two other Linux bugs were reported, both leading to a SACK slowdown or excessive resource use, which could also lead to a denial of service. There is also a SACK slowness problem that Looney identified in FreeBSD 12 when using the RACK TCP stack. Netflix contributed RACK to FreeBSD just over a year ago.

The SACK panic has been designated as CVE-2019-11477; it is clearly the most severe of the Linux problems. CVE-2019-11478 is another denial of service; by crafting a sequence of SACKs, an attacker can cause fragmentation of the TCP transmission queue, leading to higher resource use. CVE-2019-11479 points out that the MSS for Linux is set to 48, which means that a much larger amount of CPU, memory, and bandwidth could be consumed in sending relatively small amounts of user data. The fix for that is to give the administrator a sysctl knob to set the minimum value for MSS that the kernel will accept; it is left at 48 by default for compatibility, but it can now be easily changed.

The problems have been addressed; the Netflix report has links to the individual patches. Those patches were released as part of the 5.1.11, 4.19.52, 4.14.127, 4.9.182, and 4.4.182 stable updates that were made on June 17, the same day as the embargo was lifted. Distribution kernels have largely been updated at this point, so those who can upgrade should probably do so.

There are various mitigations for the problems for those unable to update on the spur of the moment. Restricting the MSS to a reasonable value using iptables or via other means will thwart these attacks, but those mitigations also require disabling MTU probing by setting the net.ipv4.tcp_mtu_probing sysctl to 0 for CVE-2019-11477 and CVE-2019-11478. Either of those CVEs can instead be thwarted by turning off SACK by setting /proc/sys/net/ipv4/tcp_sack to 0. To avoid CVE-2019-11479, administrators simply need to filter out MSS values that are too low using one of the methods listed by Netflix.

The Red Hat vulnerability report has lots of useful details, as does the Netflix report mentioned above. A remote-control kernel crash is obviously a fairly nasty surprise with potentially wide-ranging impact. It is only the endpoints of a connection that are affected, however, which limits the damage somewhat. At least the servers and desktops can be updated, which may not be true of all the gear our traffic visits on the way to its destination.

Index entries for this article
Kernel	Networking/Security
Kernel	Security/Vulnerabilities
Security	Linux kernel/Networking
Security	Networking/Vulnerabilities

The TCP SACK panic

Posted Jun 19, 2019 22:28 UTC (Wed) by gus3 (guest, #61103) [Link] (2 responses)

I wonder what this means for embedded space, specifically Android. Especially since a watchdog-timer reboot counts as a type of DOS.

The TCP SACK panic

Posted Jun 19, 2019 22:40 UTC (Wed) by Darakian (guest, #96997) [Link] (1 responses)

> Part of the bookkeeping for TCP is to keep track of the 32KB (64KB on PowerPC)

Why is there an arch specific difference in the TCP fragment buffer size?

The TCP SACK panic

Posted Jun 19, 2019 23:28 UTC (Wed) by BenHutchings (subscriber, #37955) [Link]

If I remember correctly, this size is the greater of 32 KiB and one page, and some PowerPC configurations have a page size of 64 KiB.

The TCP SACK panic

Posted Jun 19, 2019 23:11 UTC (Wed) by iabervon (subscriber, #722) [Link] (2 responses)

What was the fix for the panic? The article clearly laid out why the current code doesn't work, but I haven't seen what the new kernel does differently, and it seems like the LWN article would be a good place to record the post-fix behavior.

As a less critical improvement, it would be great to eliminate BUG_ON from the networking subsystem. At worst, it should be possible to drop packets or close connections or bring down an interface, rather than dropping all the packets, closing all the connections, and bringing down all the interfaces, along with filesystems and userspace.

The TCP SACK panic

Posted Jun 19, 2019 23:12 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

This is actually debatable. A server that cleanly reboots on a BUG might be better than an inaccessible server (SSH also depends on the network).

The TCP SACK panic

Posted Jun 19, 2019 23:41 UTC (Wed) by BenHutchings (subscriber, #37955) [Link]

> What was the fix for the panic? The article clearly laid out why the current code doesn't work, but I haven't seen what the new kernel does differently, and it seems like the LWN article would be a good place to record the post-fix behavior.

The fix was to avoid coalescing queued skbs if this would cause the segment counter to overflow.

> As a less critical improvement, it would be great to eliminate BUG_ON from the networking subsystem. At worst, it should be possible to drop packets or close connections or bring down an interface,

BUG_ON means the programmer didn't think this error was possible (but wasn't sure) and doesn't know how to handle it. If the cause was a buffer overflow, for example, trying to clean up can make things worse.

> rather than dropping all the packets, closing all the connections, and bringing down all the interfaces, along with filesystems and userspace.

In this case, sure, now we know how this goes wrong, it would obviously be better to kill the connection, but that probably wasn't so clear when the assertion was added.

The TCP SACK panic

Posted Jun 20, 2019 2:09 UTC (Thu) by josh (subscriber, #17465) [Link] (3 responses)

I'm really curious how Netflix happened across these vulnerabilities. Did they hit them in practice, did they find them through security testing, did they have some particular reason to be staring at this code...

The TCP SACK panic

Posted Jun 20, 2019 4:26 UTC (Thu) by mtaht (subscriber, #11087) [Link]

Ironically I had been pushing for a couple years now that we start exploring reducing the mss when under extreme congestion and cwnd = 2. ( https://www.bufferbloat.net/projects/ecn-sane/wiki/ )

Oops. Looks like someone else found a use for the idea.

The TCP SACK panic

Posted Jun 20, 2019 8:46 UTC (Thu) by Lennie (subscriber, #49641) [Link]

Notice how the security reports say FreeBSD and Linux:

https://github.com/Netflix/security-bulletins/blob/master...

But FreeBSD did not release any security updates, how is that possible ?

Turns out it is the FreeBSD 12 using the RACK TCP Stack:

http://freebsd.1045724.x6.nabble.com/TCP-RACK-performance...

The RACK TCP Stack was created by Netflix for their FreeBSD based CDN applience:

https://openconnect.netflix.com/en/appliances/

The TCP SACK panic

Posted Jul 7, 2019 17:48 UTC (Sun) by kmeyer (subscriber, #50720) [Link]

Looney works on FreeBSD networking at Netflix and was probably testing his own code against Linux.

The TCP SACK panic

Posted Jun 20, 2019 12:45 UTC (Thu) by sam13 (subscriber, #113386) [Link] (3 responses)

What's not clear to me is whether the overflow only occurs if segmentation offloading is enabled. According to Red Hat's advisory [1] this seems to be the case:

> When Segmentation offload is on and SACK mechanism is also enabled, due to packet loss and selective retransmission of some packets, SKB could end up holding multiple packets, counted by ‘tcp_gso_segs’.

But neither Red Hat's nor this article mention disabling segmentation offloading as a mitigation. Hence I assume that the problematic "tcp_gso_segs" field can overflow even if segmentation offloading is disabled?

[1] https://access.redhat.com/security/vulnerabilities/tcpsack

The TCP SACK panic

Posted Jun 23, 2019 8:32 UTC (Sun) by richard_weinberger (subscriber, #38938) [Link] (2 responses)

AFAICT, without offloading (GSO/GRO) the BUG_ON() cannot be triggered.

The TCP SACK panic

Posted Jun 24, 2019 18:47 UTC (Mon) by richard_weinberger (subscriber, #38938) [Link] (1 responses)

BTW: This is only true for older kernels.

commit 0a6b2a1dc2a2105f178255fe495eb914b09cb37a
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Feb 19 11:56:47 2018 -0800

tcp: switch to GSO being always on

...changed the game.

The TCP SACK panic

Posted Jul 27, 2019 14:28 UTC (Sat) by hbkmustang (guest, #133442) [Link]

So, what do you think now - if GSO turning off it will solve the problem (in linux ethtool -k ...) ?
Or there is only right way to defend from this problem - turning off sack in /proc ?

The TCP SACK panic

Posted Jun 20, 2019 23:28 UTC (Thu) by dgc (subscriber, #6611) [Link] (3 responses)

The old is new again. :/

I remember back in late 2002 when a bug report for an Irix NFS server performance issue was nailed down to a serious SACK problem due to really small MSS windows being sent from a buggy NFS client implementation.

The phrase "SACK panic" triggered my memory immediately because we used that phrase to describe what the reproducer we wrote did to the kernel. The Irix security patch release notes from early 2003 says it all:

"* Denial of Service attack involving clients sending packets with very small MSS values"

http://www.xatrix.org/news/irix-ip-denial-of-service-fixe...

That patch fixed a set of problems very similar to those being reported here....

-Dave.

The TCP SACK panic

Posted Jun 21, 2019 10:01 UTC (Fri) by bjartur (guest, #67801) [Link] (2 responses)

Do you think Linux should be tested against more Irix vulnerabilities?

The TCP SACK panic

Posted Jun 22, 2019 2:57 UTC (Sat) by dgc (subscriber, #6611) [Link] (1 responses)

> Do you think Linux should be tested against more Irix vulnerabilities?

Actually the point I was making really has nothing to do with Irix - it was just the example that illustrated my point. i.e. that the OS networking community knew about these problems 15 years ago but that knowledge seems to have been lost and so we have repeated past mistakes....

Which raises some interesting questions: where did that institutional knowledge go? Have all the network engineers of the time (like me) moved on to other things and so that knowledge has been (effectively) lost? How do we prevent the same flaw from being re-introduced and re-discovered in another 15 years?

So, as much as you probably were poking fun at Irix with your comment, I'll point out that regression test suites are actually very good for retaining knowledge of flaws like this over the long term. i.e. new developers learn about them when their changes cause unexpected failures of tests that have been around for 15+ years....

-Dave.

The TCP SACK panic

Posted Jun 22, 2019 4:07 UTC (Sat) by k8to (guest, #15413) [Link]

I dunno. My experience is that most software companies repeat the same mistakes every year.

Having to wait a decade to introduce the same errors sounds relatively spotless.

The TCP SACK panic

Posted Jun 21, 2019 8:36 UTC (Fri) by XTerminator (subscriber, #59581) [Link]

Is there any proof of concept code we can use to check vulnerability status and effectiveness of remedies?

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

The TCP SACK panic

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.