The Hidden Bug in CUBIC: When Idle Isn't Idle in QUIC Congestion Control
When a Linux kernel optimization aimed at improving TCP behavior was ported to Cloudflare's QUIC implementation, it introduced a subtle bug that could permanently cripple the CUBIC congestion controller. CUBIC, the default congestion control algorithm for Linux and thus most internet traffic, governs how TCP and QUIC connections probe for bandwidth and react to loss. At Cloudflare, our open-source QUIC library, quiche, relies on CUBIC, placing it in the critical path for a large share of our traffic. Here's the story of how a change meant to fix an app-limited exclusion issue in RFC 9438 inadvertently caused CUBIC's congestion window to get stuck at its minimum, and how a one-line fix resolved it.
1. What is CUBIC and why does it matter?
CUBIC is the default congestion control algorithm (CCA) in the Linux kernel, standardized in RFC 9438. It's used by the vast majority of TCP and QUIC connections on the public internet, making it a critical component for network performance. The algorithm determines how much data a sender can have in flight at any moment—the congestion window (cwnd). A larger cwnd allows faster data transfer, while a smaller one throttles it. CUBIC is a loss-based CCA: it increases cwnd when there's no loss (assuming available bandwidth) and decreases it sharply upon detecting packet loss (assuming network congestion). At Cloudflare, our QUIC implementation, quiche, uses CUBIC as its default, meaning this algorithm is in the hot path for a significant portion of our traffic. Understanding its behavior is crucial for maintaining high performance and reliability.

2. How does CUBIC's congestion control work?
At its core, CUBIC adjusts the congestion window based on network feedback. The algorithm operates in three phases: slow start (exponential growth), congestion avoidance (cubic growth), and recovery upon loss. When no packet loss is detected, CUBIC probes for more bandwidth by increasing cwnd. The growth follows a cubic function of time, hence the name. When packet loss occurs, CUBIC reduces cwnd by a factor (typically half) and enters recovery, after which it begins increasing again. This design aims to maximize throughput while avoiding congestion collapse. Importantly, CUBIC assumes that loss signals network capacity exhaustion. The algorithm also includes an app-limited exclusion: if the sender isn't sending enough data to fully utilize the network (e.g., because the application has no data), cwnd growth should be limited to prevent overly aggressive behavior. This nuance was at the heart of the bug.
3. What was the symptom of the CUBIC bug in QUIC?
The bug first appeared as sporadic failures in our ingress proxy integration tests. When CUBIC was tested under heavy packet loss early in the QUIC connection, the congestion window would shrink to its minimum value and then never recover. This permanent pinning of cwnd at the minimum meant the connection could no longer grow its sending rate, effectively stalling data transfer. The failure rate was alarmingly high: 61% of the test runs exhibited this behavior. Recovery after a congestion collapse—when cwnd is at its floor—is an uncommon but critical regime for any CCA. Most tests focus on steady-state growth, but this bug exposed a dangerous corner case. The connection was stuck in a mode where it could not recover, defeating the purpose of congestion control. This was particularly concerning because it affected our production QUIC traffic relying on quiche.
4. What caused the bug?
The root cause traced back to a Linux kernel change that implemented the app-limited exclusion in CUBIC, as described in RFC 9438 §4.2-12. The intent was correct: when a TCP sender is app-limited (i.e., don't have enough data to fully use the network), cwnd should not grow aggressively, avoiding undue competition. This fix addressed a real TCP problem. However, when we ported this change to our QUIC implementation, it surfaced unexpected behavior due to differences in how QUIC handles acknowledgments and pacing. The app-limited condition was being triggered during the recovery phase after loss, even when the sender had data to send. This caused CUBIC to believe the connection was app-limited and thus refused to increase cwnd. The bug was essentially a mismatch between the original TCP context and the QUIC transport semantics. A state machine quirk locked cwnd at its minimum, preventing any growth.

5. How was the bug fixed?
The fix turned out to be remarkably elegant—a near-one-line change that broke the vicious cycle. The team realized that the app-limited check should only apply when the sender is truly idle (no data to send) and not during recovery from loss. By moving the check to a later point in the code—specifically, after determining whether the connection is in a recovery state—the algorithm correctly allowed cwnd to start growing again once loss was detected and the sender had data in flight. This small adjustment prevented the erroneous permanent stall. The change was carefully tested and validated, restoring the expected behavior: CUBIC now recovers from congestion collapse in QUIC just as it does in TCP. The fix also highlighted the importance of testing edge cases, like early heavy loss, which are often overlooked in standard congestion control evaluations.
6. What are the key takeaways from this bug?
This bug serves as a powerful reminder that even well-tested algorithms can fail in new contexts. First, porting kernel-level optimizations to user-space protocols like QUIC requires careful re-evaluation of assumptions—especially around timing and state machines. Second, congestion control edge cases, such as recovery after severe loss, deserve more testing; they are rare but catastrophic when they fail. Third, the fix’s simplicity (a single line change) underscores that bugs can lurk in subtle interactions between pieces of code. Cloudflare now applies more rigorous testing for low-cwnd regimes. For the wider community, this story encourages sharing such experiences to help improve the robustness of congestion control across different transport protocols. Ultimately, it’s a win for both TCP and QUIC reliability.
Related Articles
- Security Alert: Malicious Code Found in Linux Builds of Cemu Wii U Emulator
- Meta Unveils AI Agent Platform That Recovers Hundreds of Megawatts in Hyperscale Efficiency Push
- How Meta Uses AI Agents to Supercharge Data Center Efficiency at Scale
- Exploring VECT 2.0 Ransomware Irreversibly Destroys Files Over 131KB on Windo...
- AMD's Linux Driver Pull Request Paves Way for HDMI 2.1 FRL Support
- Fedora Atomic Desktops in Fedora Linux 44: Your Top Questions Answered
- 7 Things You Need to Know About Turning Your PS5 Into a Linux Gaming PC
- Testing Sealed Bootable Container Images for Fedora Atomic Desktops: Q&A