Efficient and Economic Large Language Model Inference with Attention Offloading Paper • 2405.01814 • Published May 3, 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving Paper • 2407.00079 • Published Jun 24, 2024 • 5
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference Paper • 2505.02922 • Published May 5 • 28