When CUBIC's Congestion Window Freezes: A QUIC Bug Story

In this deep dive, we unravel a curious bug where CUBIC's congestion window (cwnd) gets stuck at its minimum value, unable to recover after a congestion collapse. The issue emerged when porting a Linux kernel optimization to our QUIC implementation (quiche). We'll explore the CUBIC logic, the unexpected symptom, the root cause, and the elegant one-line fix that resolved it. Let's dive into the questions.

What exactly is CUBIC and why is it important?

CUBIC, standardized in RFC 9438, is the default congestion controller in Linux. It governs how most TCP and QUIC connections on the public Internet probe for bandwidth, detect loss, and recover. At Cloudflare, our QUIC implementation (quiche) uses CUBIC by default. A congestion controller adjusts the congestion window (cwnd): the limit on bytes in flight (sent but not acknowledged). A larger cwnd allows more data per round trip; a smaller one throttles traffic. CUBIC is a loss-based algorithm: it increases cwnd when no loss is detected (assuming available bandwidth) and decreases it upon loss (assuming capacity exceeded). This logic helps maximize network utilization.

When CUBIC's Congestion Window Freezes: A QUIC Bug Story — Source: blog.cloudflare.com

What was the symptom that triggered the investigation?

The trouble began with erratic failures in our ingress proxy integration tests — specifically, tests where CUBIC faced heavy loss early in the connection. These failures occurred 61% of the time. Recovery after a congestion collapse is a rare scenario, but it's exactly what a congestion controller should handle. Most tests focus on steady-state or growth phases; few probe the minimum-cwnd region after a connection has been beaten down. Bugs in this corner remain hidden until triggered. Our test showed that after hitting minimum cwnd, CUBIC never increased it again, effectively freezing the connection's throughput.

What was the Linux kernel change that started this?

A Linux kernel patch aimed to align CUBIC with the app-limited exclusion described in RFC 9438 §4.2-12. The idea: when a sender is application-limited (cannot send more data because the app hasn't provided it), the congestion controller should not penalize or modify cwnd based on that idle period. This made sense for TCP, where the kernel knows when the app is idle. When we ported this logic to our QUIC implementation, it surfaced unexpected behaviors. QUIC, unlike TCP, uses a different architecture for tracking application-limited states, and the naive port introduced a subtle bug.

How did the app-limited exclusion cause cwnd to freeze?

In QUIC (and our quiche implementation), the app-limited flag was set based on whether the application had pending data to send. During a congestion collapse (heavy loss), the algorithm reduces cwnd to its minimum and enters a recovery phase. The bug: after the congestion event, if the application had no new data immediately (a common scenario due to backpressure), the code marked the connection as app-limited and skipped the usual cwnd growth during subsequent round trips. The kernel change assumed app-limited periods are brief, but in QUIC's architecture, the flag persisted longer, causing cwnd to never recover from its minimum. The connection remained stuck at a tiny window.

What was the one-line fix that solved it?

The fix was deceptively simple: we changed the condition that decided whether to apply the app-limited exclusion. Instead of checking if the application currently has no data to send (which could be true for many consecutive updates), we checked if the connection had been continuously app-limited for more than one round trip. If not, we allowed cwnd growth. This small change (essentially one line of code) broke the cycle: after congestion recovery, the first round trip with pending data would trigger cwnd increase, even if a later round trip encountered idle moments. The connection could now escape the minimum-cwnd trap and resume normal CUBIC behavior.

What lessons can we learn from this bug?

This story highlights several key takeaways. First, porting kernel optimizations to user-space protocols like QUIC requires careful attention to architectural differences. What works for TCP (where the kernel controls both scheduling and congestion) may break in QUIC (where the application layer governs data availability). Second, testing edge cases like minimum-cwnd recovery is critical — most performance tests ignore these regimes. Third, a small, elegant fix can resolve a seemingly complex issue once you understand the root cause. Finally, the app-limited concept, while valid for TCP, must be adapted to QUIC's context to avoid unintended cwnd freeze.