The following content is copyrighted from ASPLOS 2025 | Awesome Papers, and I have extracted parts that are of interest.

LLM Inference

Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow [arXiv] [Code]
- CMU
Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management
- Korea University
COMET: Towards Practical W4A4KV4 LLMs Serving
- ICT, CAS
Past-Future Scheduler for LLM Serving under SLA Guarantees
- Beihang University & SenseTime
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
- UW & MSR India
Medusa: Accelerating Serverless LLM Inference with Materialization
- THU
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [arXiv]
- MSR India & IIS
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
- UIUC & Microsoft Azure Research
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
- ICT, CAS & ETH & UofT & NVIDIA
PIM is All You Need: A CXL-Enabled GPU-Free System for LLM Inference
- UMich & ETH & Google
Fast On-device LLM Inference with NPUs
- PKU & BUPT

LLM-based Applications

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
- UC Berkeley
Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
- SYSU & HKUST & Huawei & Peng Cheng Laboratory

Accelerating Retrieval-Augmented Generation
- Cornell & Kansas & UMass Amherst & Samsung Electronics

Shared ML Clusters
- Design and Operation of Shared Machine Learning Clusters on Campus
  - HKUST
Resource Oversubscription
- Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms
  - Microsoft
Serverless Computing
- Litmus: Fair Pricing for Serverless Computing
  - Binghamton & Intel Lab
- Concurrency-Informed Orchestration for Serverless Functions
  - UVA & Alibaba & Amazon
- Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
  - ICT, CAS
Graceful Degradation
- Cooperative Graceful Degradation in Containerized Clouds [arXiv]
  - UC Irvine
Microservice
- Embracing Imbalance: Dynamic Load Shifting among Microservice Containers in Shared Clusters
  - University of Macau
GPU Sharing
- Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads [arXiv]
  - Stanford & UofT