of Large Language Models on Distributed Infrastructures: A Survey”, arXiv, 2024. • Qian Ding, “Transformers in SRE Land:Evolving to Manage AI Infrastructure”, USENIX SREcon25 America, 2025. • Deepak Narayanan, et al., “Ef fi cient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM”, the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2021. • Yusheng Zheng, et al., “Extending Applications Safely and Ef fi ciently”, USENIX OSDI, 2025. • Yiwei Yang, et al., “eGPU: Extending eBPF Programmability and Observability to GPUs”, Workshop on Heterogeneous Composable and Disaggregated Systems (HCDS), 2025. 4. ·ͱΊ 56