Preliminary Program – ASPLOS 2025

The following content is copyrighted from ASPLOS 2025 | Awesome Papers, and I have extracted parts that are of interest.

LLM Inference

  • Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow [arXiv] [Code]
    • CMU
  • Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management
    • Korea University
  • COMET: Towards Practical W4A4KV4 LLMs Serving
    • ICT, CAS
  • Past-Future Scheduler for LLM Serving under SLA Guarantees
    • Beihang University & SenseTime
  • POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
    • UW & MSR India
  • Medusa: Accelerating Serverless LLM Inference with Materialization
    • THU
  • vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [arXiv]
    • MSR India & IIS
  • TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
    • UIUC & Microsoft Azure Research
  • PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
    • ICT, CAS & ETH & UofT & NVIDIA
  • PIM is All You Need: A CXL-Enabled GPU-Free System for LLM Inference
    • UMich & ETH & Google
  • Fast On-device LLM Inference with NPUs
    • PKU & BUPT

LLM-based Applications

  • Towards End-to-End Optimization of LLM-based Applications with Ayo
    • CUHK

MoE Inference

  • MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
    • UC Berkeley
  • Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
    • SYSU & HKUST & Huawei & Peng Cheng Laboratory

Retrieval-Augmented Generation (RAG)

  • Accelerating Retrieval-Augmented Generation
    • Cornell & Kansas & UMass Amherst & Samsung Electronics

Resource Management

  • Shared ML Clusters
    • Design and Operation of Shared Machine Learning Clusters on Campus
      • HKUST
  • Resource Oversubscription
    • Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms
      • Microsoft
  • Serverless Computing
    • Litmus: Fair Pricing for Serverless Computing
      • Binghamton & Intel Lab
    • Concurrency-Informed Orchestration for Serverless Functions
      • UVA & Alibaba & Amazon
    • Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
      • ICT, CAS
  • Graceful Degradation
    • Cooperative Graceful Degradation in Containerized Clouds [arXiv]
      • UC Irvine
  • Microservice
    • Embracing Imbalance: Dynamic Load Shifting among Microservice Containers in Shared Clusters
      • University of Macau
  • GPU Sharing
    • Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads [arXiv]
      • Stanford & UofT

results matching ""

    No results matching ""