🚀Meet LMCache – Your secret weapon for fast and cost-efficient LLM inference!
⚡With 7x faster access to 100× more KV caches, LMCache accelerates #vLLM for faster multi-turn conversations and RAG.
Blog: lmcache.github.io/2024-09-17-rel…
Github: github.com/LMCache/LMCache
#LLM #LMCache #RAG
LMCache Lab
239 posts
🧪 Open-Source Team that maintains LMCache and Production Stack
🤖 Democratizing AI by providing efficient LLM serving for ALL
- 8 KV-Cache Systems You Can’t Afford to Miss in 2025 By 2025, KV-cache has evolved from a “nice-to-have” optimization into a critical layer for high-performance large language model (LLM) serving. From GPU-resident paging tricks to persistent, cross-node cache sharing, the
- Everyone is focused on faster LLM inference engines. But bigger potentials might be reached with what is beyond the engine. 🚀 The real frontier could be the orchestration layer above it. Replicating engines with Kubernetes is hitting a wall. We need stateful, LLM-native
- 1K Stars ⭐ for 𝘃𝗟𝗟𝗠 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗦𝘁𝗮𝗰𝗸! 🤝 We thank every contributor and user who has supported our journey in building an easy-to-use and high-performance serving stack for vLLM! We're thrilled to have reached this milestone. 😬 Been among the 𝘃𝗲𝗿𝘆
- 🚀 We're thrilled to announce vLLM Production Stack—an open-source, Enterprise-Grade LLM inference solution that is now an official first-party ecosystem project under vLLM! Why does this matter? A handful of companies focus on LLM training, but millions of apps and businesses
- 🚀 𝗟𝗠𝗖𝗮𝗰𝗵𝗲 Powers Up 𝘃𝗟𝗟𝗠 𝗩𝟭: P/D Disaggregation & NIXL Support! vLLM V1 revolutionized LLM serving, but lacked a dedicated KV cache interface for advanced optimizations... until NOW! ⚡ LMCache Lab is thrilled to announce two major updates enhancing vLLM V1's
- 🔥Meet the vLLM Official Production Stack🔥 -⚡️ 3x higher throughput & 3x faster response! -🔧 Easy k8s deployment with helm chart! -📈 Observability dashboard! And it’s open-source under vllm-project! Code: github.com/vllm-project/p… Blog: blog.lmcache.ai/2025-01-21-sta… #LLM #vLLM #k8s
- 🚨 LMCache now turbocharges multimodal models in vLLM! By caching image-token KV pairs, repeated images now get ~100% cache hit rate — cutting latency from 18s to ~1s. Works out of the box. Check the blog: blog.lmcache.ai/2025-07-03-mul… Try it 👉 github.com/LMCache/LMCache #vLLM #MLLM
- 🚀 Big news from LMCache Lab! 📝 3 papers accepted at SOSP ’25 & NSDI ’26, pushing the frontier of LLM-inference efficiency: 1️⃣ Cross-agent KV-cache sharing (NSDI) 🔗 arxiv.org/abs/2411.02820 2️⃣ Custom design for LLM prefillers (SOSP) 🔗 arxiv.org/abs/2505.07203 3️⃣
- LMCache supports gpt-oss (20B/120B) on Day 1! TTFT 1.20s → 0.39s (-67.5%), finish time 15.70s → 7.73s (-50.7%) compared to Vanilla vLLM. Release the true power of GPT-OSS with vllm+LMCache -- full deployment tutorial here: blog.lmcache.ai/2025-08-05-gpt… #LMCache #vLLM #OpenAI #LLM
- Want to create your own LLM Inference Endpoint on Any Cloud in seconds? We're announcing the alpha release of LMIgnite, the one-click high-performance inference stack built for speed and scale. Powered by LMCache, vLLM, and vLLM Production Stack. 🤖 Join the alpha and
- You might know LMCache Lab for our KV cache optimizations that make LLM prefilling a breeze. But that’s not all! We’re now focused on speeding up decoding too—so your LLM agents can generate new content even faster. In other words: you can save on your LLM serving bills by
- CacheGen(arxiv.org/abs/2310.07240) lets you store KV caches on disk or AWS S3 and load them way faster than recomputing! Modern LLMs use long contexts, but reprocessing these every time is slow and resource-intensive. While engines like vLLM (and LMCache) can cache contexts in
- Amazing tool! Absolutely a game-changer for understanding open-source projects! @cognition_labs @silasalberti Finding out more about LMCache and vLLM Production Stack on Deepwiki. 🚀 LMCache: deepwiki.com/LMCache/LMCache 🚀 vLLM Production Stack: deepwiki.com/vllm-project/p… #DeepWikiwe built DeepWiki, a free encyclopedia of all GitHub repos some numbers: - 30k repos already indexed - processed 4 billion+ lines of code - the indexing alone cost $300k+ in compute spend
00:00

















