Networking for Stateful LLM Inference

Junchen Jiang; Yuhan Liu; Zhuohan Gu; Qizheng Zhang; Chen Wang; Yue Zhu; Ruoyu Qin; Yuwei An; Xiangfeng Zhu; Kuntai Du; Huaizheng Zhang; Shaoting Feng

SIGCOMM 2025

Tutorial

08 Sep 2025

Networking for Stateful LLM Inference

Abstract

This tutorial offers a comprehensive, hands-on introduction to LMCache, a high-performance KV cache management layer for distributed LLM inference. The morning session begins with an overview of distributed LLM inference systems and a one-click installation of LMCache. The first session focuses on experiencing LMCache’s performance benefits through building agentic and retrieval-augmented generation (RAG) applications and visualizing the speedups brought by LMCache with Grafana. After lunch, Session B dives deeper into technical details such as KV cache sharing, disaggregated prefill [3], Mooncake storage backend integration [7], KV cache compression [1, 6], and multi-modality support. The afternoon concludes with sessions on autoscaling, vLLM integration, and an open Q&A and wrap-up.

Conference paper