Tracing a QUIC Bug Back to a Linux Kernel Patch: How CUBIC’s Idle Handling Went Awry
This article explores a subtle bug uncovered in Cloudflare's QUIC implementation (quiche), which stemmed from porting a Linux kernel optimization for the CUBIC congestion controller. The issue caused the congestion window (cwnd) to get stuck at its minimum after severe loss, failing recovery. Below, we answer key questions about CUBIC, the bug, and the elegant fix.
What is CUBIC and why is it important for TCP and QUIC?
CUBIC, standardized in RFC 9438, is the default congestion control algorithm in the Linux kernel. It governs how most TCP and QUIC connections on the public Internet probe for bandwidth, react to packet loss, and recover. At Cloudflare, our open-source QUIC implementation (quiche) uses CUBIC as its default, meaning this code is critical for a large share of our traffic. CUBIC is a loss-based algorithm: it increases the congestion window (cwnd) when no loss is detected and decreases it upon loss, assuming capacity has been exceeded. Its widespread adoption makes any bug in CUBIC potentially far-reaching.

What was the symptom of this bug?
The bug first appeared as erratic failures in our ingress proxy integration tests. Specifically, tests that simulated heavy packet loss early in a QUIC connection—followed by a recovery phase—failed about 61% of the time. In a healthy scenario, CUBIC should reduce cwnd on loss, then gradually increase it again. Instead, the cwnd remained pinned at its minimum value (typically two or three packets) indefinitely, never recovering. This is a critical failure because a congestion controller that can't recover from congestion collapse renders the connection virtually unusable. The symptom was reproducible and pointed to a deep flaw in how CUBIC handled the transition from loss to growth when the sender was app-limited (i.e., not fully utilizing the window).
How did a Linux kernel change intended for TCP trigger this bug in QUIC?
The story begins with a Linux kernel patch that aligned CUBIC with RFC 9438's app-limited exclusion (section 4.2-12). In TCP, an app-limited sender may not send enough data to fill the cwnd, so the algorithm should not count such idle periods as evidence of no loss (which would incorrectly inflate cwnd). The patch added logic to detect app-limited states and skip cwnd growth during those rounds. When this logic was ported to quiche, it interacted unexpectedly with QUIC's different ACK timing and flow control. In QUIC, the sender often experiences app-limited periods, especially after loss when the receiver's flow control window shrinks. The ported code treated these as “idle” and permanently suppressed cwnd growth, causing the bug. What worked for TCP broke for QUIC because of different semantics around idle detection.
What exactly was the root cause in the code?
The root cause was a combination of two conditions: (1) The new app-limited detection incorrectly flagged recovery periods as app-limited, and (2) CUBIC's state machine then refused to increase cwnd during those flagged rounds. Specifically, the check for whether the sender had enough data to fill the cwnd used the current cwnd, which at minimum is tiny. After loss, the cwnd is small, so even a small amount of data could fill it—but the code compared against a threshold that assumed a larger window. This created a feedback loop: the small cwnd made the sender appear app-limited, which prevented cwnd growth, keeping it small. The fix was to adjust the condition so that app-limited detection only applies when the sender truly cannot send more due to application limits, not due to a cwnd that was just reduced by loss. The elegant near-one-line change broke this cycle.

How was the bug fixed, and what was the impact?
The fix was surprisingly simple: one line changed in the app-limited exclusion logic. Instead of comparing the amount of data the sender could transmit against the current cwnd, it compared against the target cwnd (the value CUBIC would aim for after recovery). This prevented the false app-limited signal during recovery. After the patch, the previously failing tests passed reliably. The impact was immediate—connections suffering severe loss recovered normally, with cwnd growing from minimum back to a useful size. The fix has been merged into quiche and upstreamed where appropriate. It highlights how even well-intentioned optimizations can have unintended consequences when ported across protocols.
What lessons does this bug teach about congestion control testing and protocol differences?
The bug underscores two important lessons. First, congestion control testing must include edge-case scenarios like recovery from minimum cwnd, not just steady-state growth. Most tests focus on normal operation, but bugs often hide in corners. Second, porting code between TCP and QUIC requires careful scrutiny of assumptions about idle detection, ACK timing, and flow control. QUIC's different architecture (e.g., multiple streams, variable ACK frequencies) can break logic that works for TCP. The one-line fix was elegant, but finding it required deep understanding of both protocols. Developers should always verify that app-limited and idle states are defined consistently across the protocol stack.
Related Articles
- How to Select Server Locations in Firefox's Free VPN: A Step-by-Step Guide
- Fedora Asahi Remix 44 Now Live: Apple Silicon Macs Get Latest Fedora Linux
- Harnessing AI Agent Teams: How Squad Helps Developers Tackle the Rising Tide of Vulnerabilities
- 10 Key Insights Into Fedora’s New Sealed Atomic Desktop Bootable Container Images
- AMD Expands HDMI 2.1 Capabilities: Display Stream Compression Hits AMDGPU Linux Driver
- Key Security Patches: Linux Distributions Update Critical Packages
- Canonical Ships Ubuntu 26.04 LTS 'Resolute Raccoon' Without Xorg Desktop Session
- Fedora Atomic Desktops Introduce Sealed Bootable Container Images for Enhanced Security