Marcelo Amaral, Tatsuhiro Chiba, et al.
CLOUD 2022
This tutorial offers a comprehensive, hands-on introduction to LMCache, a high-performance KV cache management layer for distributed LLM inference. The morning session begins with an overview of distributed LLM inference systems and a one-click installation of LMCache. The first session focuses on experiencing LMCache’s performance benefits through building agentic and retrieval-augmented generation (RAG) applications and visualizing the speedups brought by LMCache with Grafana. After lunch, Session B dives deeper into technical details such as KV cache sharing, disaggregated prefill [3], Mooncake storage backend integration [7], KV cache compression [1, 6], and multi-modality support. The afternoon concludes with sessions on autoscaling, vLLM integration, and an open Q&A and wrap-up.
Marcelo Amaral, Tatsuhiro Chiba, et al.
CLOUD 2022
Pranjal Gupta, Karan Bhukar, et al.
ICPE 2025
Abhishek Malvankar, Olivier Tardieu
KubeCon EU 2024
Darya Kaviani, Sijun Tan, et al.
RWC 2025