Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RG-Arch輪考資料: QUIC is not Quick Enough over Fast...

RG-Arch輪考資料: QUIC is not Quick Enough over Fast Internet

Avatar for kota-yata

kota-yata

June 26, 2025
Tweet

More Decks by kota-yata

Other Decks in Programming

Transcript

  1. QUIC is not Quick Enough over Fast Internet Xumiao Zhang,

    Shuowei Jin, Yi He, Ahmad Hassan, Z. Morley Mao, Feng Qian, Zhi-Li Zhang Presenter: Kota Yatagai WWW '24: Proceedings of the ACM Web Conference 2024
  2. Abstract • QUIC is a UDP-based transport protocol, expected to

    be a game-changer in improving web application performance • Most prior studies focused on lower-bandwidth scenarios or theoretical performance • This paper fills this knowledge gap, quantifying QUIC's performance limitations on fast networks • The experiments show that HTTP/3 underperforms HTTP/2 in a high-bandwidth environment 2
  3. Rise of high bandwidth network 3 • WiFi 6/7, and

    5G, which often reach more than 500 Mbps and up to 1+ Gbps per connection • The needs also exist: 4K/8K live streaming, VR/AR and cloud gaming • QUIC/HTTP3 is often explained “high throughput” but remain insufficiently studied under high-speed scenarios.
  4. Brief overview of HTTP/x 4 • HTTP/2: TCP+TLS1.2 ◦ Traditional

    application protocol ◦ TCP stack is in kernel • HTTP/3: QUIC ◦ Newer application protocol ◦ QUIC is in user space
  5. QUIC Transport Performance 5 • CPU: Intel Xeon E5-2640 on

    the server, Intel Core i7-6700 on the client • OS: Ubuntu 18.04 • The server and the client are connected through a 1-Gbps Ethernet ◦ Only two hops away from each other • HTTP Server: OpenLiteSpeed build based on LSQUIC • Congestion Control: CUBIC (default in Linux kernel) • UDP and TCP buffer sizes are adjusted to exceed 10x the link’s BDP ◦ This prevents buffer starvation Comparing HTTP/3 vs HTTP/2 in file transmissions
  6. File Download on Lightweight Clients: throughputs 6 • The client

    uses curl on HTTP2, curl on HTTP3 and quic_client • HTTP2 always outperforms HTTP3 • On average, the throughput of cURL running HTTP3 and that of quic_client is 7-16% and 8-12% lower Downloading files of different sizes ranging from 50MB to 1GB
  7. File Download on Lightweight Clients: cpu usages 7 • The

    CPU usage for cURL when running HTTP/3 is higher than that of cURL on HTTP/2 • quic_client’s CPU usage is further elevated ◦ quic_client is a minimal implementation with no optimized packet handling nor thread separation Downloading files of different sizes ranging from 50MB to 1GB
  8. File Download on Lightweight Clients: varying bandwidth 8 • At

    low bandwidth (<600 Mbps), HTTP/3 and HTTP/2 exhibit similar performance • Beyond 600 Mbps, HTTP/3's actual throughput starts to be bottlenecked • A noticeable performance disparity emerges as bandwidth increases
  9. File Download on Chrome: throughputs 9 • Chrome with HTTP/2

    always outperforms Chrome with HTTP/3 • The difference is larger than that of when using curl Downloading files of different sizes ranging from 50MB to 1GB
  10. File Download on Chrome: cpu usage and varying badwidth 10

    • Pretty much the same result as curl • HTTP/3’s average throughput can barely hit 478 Mbps on Chrome.
  11. Application Study: Video streaming methodology 11 • ffmpeg to encode

    a custom 4K video with H.264, generating six tracks at different bitrates ◦ Bitrate: 20 Mbps, 40 Mbps, 80 Mbps, 120 Mbps, 160 Mbps, and 200 Mbps • dash.js to package three different chunk lengths: 1s, 2s, and 4s • Bitrate adaptation algorithms ◦ Buffer based: change bitrate looking at the playback buffer ◦ Rate based: always seek the highest possible bitrate • In addition to 1Gb-Ethernet, 4G and 5G traces are used for the experiment
  12. Application Study: Video streaming result 12 • HTTP/2 achieves higher

    bitrate for 1Gb-Ethernet and 5G ◦ The primary reason could be CPU resource utilization • No significant differences when streaming over 4G ◦ Mid bandwidth does not make difference between HTTP/2 and HTTP/3
  13. Application Study: Web page loading metrics 13 • Alexa Top

    100 sites, each site tested 20 times ◦ download and host them locally so the authors can manually switch HTTP/2,3 • Content Download Time (CDT) ◦ the time to download all content needed to load the website, after which the rendering process can start • Page Load Time (PLT) ◦ the rendering of all components of the page is finished • Time To First Byte (TTFB) ◦ the delay from sending the request to receiving the first byte of the response
  14. Application Study: Web page loading 14 • Alexa Top 100

    sites, each site tested 20 times ◦ download and host them locally so the authors can manually switch HTTP/2,3
  15. Root Cause Analaysis 15 • Server software ◦ Running niginx

    instead of OpenLiteSpeed ◦ 1GB file transmission conducted and QUIC was even slower by 18% • UDP/TCP performance ◦ iPerf conducted to both and no significant difference was observed ◦ UDP achieved 958 Mbps and TCP achieved 944 Mbps on average • TLS Encryption ◦ Initial test was done with TLS_AES_128_GCM_SHA256 ◦ Other cipher suites benchmarked and there were no significant differences Eliminating Non-contributing Factors
  16. Root Cause Analaysis 16 • QUIC Parameter tuning ◦ Packet

    pacing, path MTU discovery etc. ◦ No noticeable improvements were observed by these changes • Client OS ◦ Initial test was done on Ubuntu for both the client and server ◦ MacOS and Windows were tested and improvements weren’t observed • Disk and Memory ◦ tmpfs and HugePages were tested and there were no differences ◦ (this was trivial enough: none of those tests were memory intensive) Eliminating Non-contributing Factors
  17. Root Cause Analysis by Packet Capturing 17 • HTTP/3 perceives

    much more packets than HTTP/2 ◦ The number of packets received by the OS’s UDP stack was an order of magnitude higher than that by the TCP stack ◦ This was not by retransmissions, and all QUIC packets were MTU-sized ◦ TCP utilized GRO while QUIC does not • HTTP/3 has a much Higher RTT Dominated by Local Processing ◦ 1.9ms for HTTP/2 and 16.2ms for HTTP/3 (Download) ◦ ping RTT between two machines is only 0.23ms ▪ The endpoint packet processing takes most of the latency
  18. Root Cause Analysis via OS/Chromium Profiling 18 • Excessive Receiver-side

    Processing in the Kernel ◦ Linux net subsystem was monitored duing 1GB file downloads ◦ netif_receive_skb: 231K calls for UDP, only 15K calls for TCP ◦ This corresponds to the number of packets received • As explained, NIC offloading has been widely used for TCP ◦ TSO and GSO on the sender side and GRO on the receiver side ◦ QUIC has variable-length encrypted packets, which makes GSO/GRO hard ◦ Transmitting many UDP datagrams in a single burst is a sin
  19. Root Cause Analysis via OS/Chromium Profiling 19 • Offloading mechanisms

    (TSO, GSO, and GRO) enabled and disabled on both server and client sides • QUIC does not utilize GSO/GRO ◦ ⇧LIE. Chrome QUIC simply did not utilize GSO/GRO at that point ◦ in 2024, at least quic-go and ms-quic supported GSO Additional experiments for QUIC packet offloading
  20. Root Cause Analysis via OS/Chromium Profiling 20 • In addition

    to packet decoding, HTTP/3 needs to generate response packets such as ACK in the user space, which introduces another overhead Excessive Receiver-side Processing in the User Space
  21. Recommendations for Mitigation 21 • Reducing user-kernel switching is the

    key • GSO and GRO alleviates the frequent system call (if implemented in NIC), or the frequent data copy (if implemented in the kernel) • Many QUIC implementation now supports GSO ◦ Kernels don’t do PMTUD for UDP, so QUIC has to tell what size to segment ◦ quiche, quic-go, ms-quic etc. Adoption of UDP GSO/GRO (sendmmsg, recvmmsg)
  22. Recommendations for Mitigation 22 • Reducing the frequency alleviates the

    response generation overhead ◦ QUIC’s default ACK frequency is 1 ACK per 2 packets ◦ Of course with the trade off of slower packet loss detection though Delayed-Ack
  23. Recommendations for Mitigation 23 • Chromium uses a single thread

    for receiving network data • Using multi-threaded download can improve the performance • Parallel download using curl actually worked Multi-threaded download
  24. Conclusion 24 • Performance Gap Exists ◦ HTTP/3 shows lower

    throughput and higher CPU usage than HTTP/2 in various scenarios, especially on the client side. • Root Causes Identified ◦ No kernel offloading for QUIC (UDP-based) ◦ Higher packet count and processing overhead ◦ User-space congestion control and retransmission handling, Acknowledgment • Multiple IETF drafts and implementations for its mitigation exist