Tuesday, November 18, 2025

Cutting Payload aka Packet Trimming—the CRISPR-Cas9 of Network Flow Control?

One of my tasks in the (backbone) network team at Switch involves identifying and following "trends" outside our bubble that might somehow end up affecting our business. A current trend with possible impact is "AI"—maybe you have heard of it—which today means mostly machine learning using (deep) artificial neural networks, and in particular large language models (LLMs).

There are already noticeable changes in how datacenters are built to accommodate LLMs—check out what OCP has been doing in the past few years. These changes include (datacenter) networks, which must now support the needs of large GPU clusters, in particular for LLM training. To use those high-performance—and high-cost—GPUs effectively, one wants to be able to copy large amounts of data between GPUs anywhere in the cluster, which today might contain 32768 GPUs or so, with some striving up to a scale of millions. These transfers typically use remote memory-to-memory transfer without CPU involvement (RDMA). There are significant efforts to make this work over some sort of standardized Ethernet-based network. This could be done with ROCE (RDMA over Converged Ethernet) (v2), where the "Converged" hints at some modifications beyond traditional Ethernet.

Some techniques were already considered for addition to Ethernet for datacenter networks before the LLM hype wave, for example PFC (priority flow control) to support "lossless" operation of the network for specific subsets of traffic, or novel ECMP (equal-cost multipath) approaches that don't necessarily preserve per-flow ordering (using "spraying" instead).

Recently I was excited when I heard about another technique to be added: packet "trimming". It reminded me of a great (video) presentation by Mark Handley (UCL) from SIGCOMM 2017. I'll add the link later. For now I'll refer to this work as "the NDP paper".

I remember being impressed when I saw the video, and I also found that it was well received. Just having a paper accepted at SIGCOMM is one of the highest achievements for (Internet-affine) network researchers, and this paper was awarded best paper that year. I also remember thinking something along the lines of... "wow, that was an awesome presentation and must have been the talk of the conference... but what are the chances of this being implemented in production networks within my lifetime?". Focused on "classical" Internet backbones and enterprise networks, I thought that this was just a bit too revolutionary/"disruptive" to have much of a chance in our environment, which has shown strong resistance to seemingly much simpler changes (AQM, ECN, making addresses wider...). But 7–8 years later and thanks to the investment craze kicked off by ChatGPT, maybe the time is ripe!

One thing I was wondering about was whether the NDP paper was the first to introduce "trimming". Upon re-watching the video after several years (and understanding a bit more than at previous attempts :-) I caught Mark referring to previous work from Tsinghua as the inspiration for the trimming. A bit of searching turned up this NSDI 2014 presentation of the paper "Catch the Whole Lot in an Action: Rapid Precise Packet Loss Notification in Data Centers" by Peng Cheng (then a Ph.D. student at Tsinghua University, now at Microsoft Research Asia) et al. It includes a nice introduction where Peng Cheng talks about cooking and his use of the scissors as a multipurpose tool, suggesting that his "cut payload" approach could be similarly broadly useful. (A bit like the CRISPR "gene scissors" in bioengineering today, no?) Maybe Peng Cheng et al.'s work could be considered for a "test of time" award.

So where are we now, in 2025? Looks like concepts from the NSDI 2014 and SIGCOMM 2017 papers (trimming, spraying...) will be added to the Ultra Ethernet standard. It is not clear to me how the trimming capability will be used for congestion management. Maybe not at all in the initial standard?

Thinking a bit further: Assuming that trimming becomes a standard function in data center switches, could we use it in other network contexts? For example, to reduce "bufferbloat" for broadband users, or generally in the wide-area Internet... If people see potential for this, then I guess some work (including standardization work, presumably in the IETF) would be needed to ensure that trimmed packets could safely be sent over the Internet, and on Internet-deployable transport protocols (based on something like NDP?) that can make use of them.

My intuition says that there is some potential there, and if done well, this novel idea could become a useful addition to the Internet.

But one assumption that justifies trimming is that "serialization delay" (i.e. the time it takes to put the bits of a packet on the wire) is significant relative to "propagation time" (the time it takes the packet to travel through the network). This can be true for networks that are relatively small in geographical size—for example within a data center—or for networks that have relatively "slow" (low-bitrate) links; or both, of course! (There is also a dependence on packet size, but for now let's assume that the maximum packet size is similar for all networks, between ~1500 and ~9000 bytes. This is the case in practice today, although it would be interesting to study in how far trimming could make large-MTU networks more viable.)

So assuming that trimming makes sense in a DC network where end-to-end round trips take on the order of a microsecond (100m distance) and link speeds are on the order of 100Gb/s, it should make just as much sense on a network with one millisecond RTT and links of 100Mb/s, or one with 100ms RTT and links of 1Mb/s. OK, such delay/rate combinations aren't frequently seen in classical "research & education networking" anymore these days. But some interesting link technologies used at the network "edge" may sometimes exhibit such low rates—Wi-Fi when the station is far away from the base, or low-energy short/medium-distance wireless.

My gut feeling is that packet trimming may become commonplace in data center networks, provided that the AI "summer" continues for a few years. It may disappear again because the complexities (integrating it with higher-level protocols) may outweigh the gains. But it's encouraging that such "outside-the-box" ideas can gain traction in the market—when the stars align just right.

Oh yeah, here is the promised link to Mark Handley's presentation: Re-architecting datacenter networks and stacks for low latency and high performance, SIGCOMM 2017 proceedings. See the "supplementary material" section for the video.