vLLM cache

cache获取逻辑为：

cache engine -> backend -> paged_attention

Cache_Engine部分

对于gpu，cache_engine有gpu_engine和cpu_engine。

大小一致，为List[torch.Tensor]。
List的序号代表num_layers。
Tensor的大小为kv_cache_shape，在paged_attention中获得，shape:(2, num_blocks, block_size * num_kv_heads * head_size)

Backend部分

query: shape = [num_tokens, num_heads \* head_size] key: shape = [num_tokens, num_kv_heads \* head_size] value: shape = [num_tokens, num_kv_heads \* head_size] kv_cache = [2, num_blocks, block_size \* num_kv_heads \* head_size]

copy的细节：

[ ] liner

vLLM cache engine

vLLM cache

results matching ""

No results matching ""