Preface
Introduction
Blogs
Resources
Scientific Research Drawing Tools
Paper Reading Notes
Machine Learning System
【ASPLOS 2024】【在读中】SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
【SOSP 2024】【在读中】Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
【ASPLOS 2024】【在读中】ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
【OSDI 2024】Llumnix: Dynamic Scheduling for Large Language Model Serving
【Arxiv 2024】【在读中】Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
【NSDI 2023】【在读中】Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
【ICML 2023】FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
【OSDI 2020】Orca: A Distributed Serving System for Transformer-Based Generative Models
【SOSP 2023】Efficient Memory Management for Large Language Model Serving with PagedAttention(vLLM)
【OSDI 2020】Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads(Gavel))
【Arxiv 2023】 S-LORA: SERVING THOUSANDS OF CONCURRENT LORA ADAPTERS
【SC 2022】CoGNN: efficient scheduling for concurrent GNN training on GPUs
High Performance Computing
【SC 2023】Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous Machines
Graph
【SIGMOD 2020】GPU-Accelerated Subgraph Enumeration on Partitioned Graphs
【SC 2022】VSGM: View-Based GPU-Accelerated Subgraph Matching on Large Graphs
【ICDE 2023】Efficient Multi-GPU Graph Processing with Remote Work Stealing
Study Notes
Machine Learning System
vllm
vLLM Schedule
Detailed explanation of vllm block mechanism
vllm gpu paged attention
vLLM cpu paged attention
vLLM cache engine
vllm LoRA
vllm async llm
vllm chunked prefill
vllm ray
vllm profile
vllm llama
vllm-metadata
ML Refresher / Softmax Regression
Automatic differentiation
llama2部署记录
LLM Inference Series
Llama Model's decoder computing
大模型训练:流水线并行
大模型训练:数据并行
大模型训练:流水线并行
大模型训练:张量并行
计算量统计
【正在进行】llumnix代码解读
【正在进行】speculative_decode相关笔记
NLP
如何从头实现一个神经网络
史上最详细循环神经网络讲解(RNN/LSTM/GRU)
完全图解RNN、RNN变体、Seq2Seq、Attention机制
从零开始实现循环神经网络(无框架)
损失函数(Loss Function)
自回归模型和GPT
Resource Schedule
新型计算机系统设计与性能优化课程笔记-AI4Sys
CME213
C++ notes
CUDA Notes in HW3
MIT 6.172
Introduction and Matrix Multiplication
Bentley Rules for Optimizing Work
Bit Hacks
Parallel Storage Allocation
HW:Profiling Serial MergeSort
CUDA
Chapter 5 in CUDA C++ Programming Guide
cuda-partition-camping.md
CUDA-Warp-Level.md
OpenMp
OpenMp Notes
Pytorch
torch multiprocessing
Published with GitBook
Graph
results matching "
"
No results matching "
"