Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption

Keep the Cost Down: A Review on Methods to Optimize
LLM’s KV-Cache Consumption Presented by Kamolphan Liwprasert 2024-08-19 Shi, L., Zhang, H., Yao, Y., Li, Z., & Zhao, H. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption (arXiv:2407.18003). arXiv. https://doi.org/10.48550/arXiv.2407.18003

http://arxiv.org/abs/2407.18003

Introduction

Background • LLMs are now widely used, but their eﬀiciency
is challenged by the Transformer architectureʼs struggle with handling long texts. • KV-Cache has emerged as a pivotal solution to this issue ✅ Converting the time complexity of token generation from quadratic to linear ❌ Increased GPU memory overhead proportional to conversation length Shi, L., Zhang, H., Yao, Y., Li, Z., & Zhao, H. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption (arXiv:2407.18003). arXiv. https://doi.org/10.48550/arXiv.2407.18003

Goals 1. Optimizing KV-Cache space usage of LLM: pre-training, deployment,
inference phases 2. Review of the landscape of LLM Optimization 3. Metrics for evaluating the long-text capabilities of large language models, from both eﬀiciency and capability perspectives

Challenges of LLM 🔥 Decoder Only Transformer architecture has a
quadratic time complexity when understanding text sequences. 🔥 During inference, the auto-regressive decoding mechanism amplifies this issue, as it repeats the process for each token generated.

What is KV-Cache? KV-Cache, by storing the keys and values
tensor in attention module generated by past tokens, can reduce the time complexity required to generate each token to linear, greatly improving inference eﬀiciency. KV-Cache is a mechanism that leverages the properties of causal masking in MHA (Multi-Head Attention) to store and reuse intermediate computations, thereby optimizing the eﬀiciency of LLMs, especially when dealing with long sequences.

How does KV-Cache work? 1. Token production: Each token produces
intermediate K and V tensors. 2. Token generation: When generating subsequent tokens, the KV tensors of preceding tokens are required to compute the self-attention. 3. KV caching: These K and V tensors are cached in GPUs, which is known as the KV cache. Gao, B., He, Z., Sharma, P., Kang, Q., Jevdjic, D., Deng, J., ... & Zuo, P. (2024). AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving. arXiv preprint arXiv:2403.19708.

Challenges of KV-Cache 🔥 KV-Cache will increase linearly with the
length of the sequence, and the memory required will become larger and larger, especially for giant models like GPT-3. 🔥 It is also hard to reuse the same dialogues and will become a bottleneck from the low memory bandwidth of GPU comparing to its computation.

Beneﬁt of optimizing KV-Cache ✅ Reducing memory usage, which leads
to cost reduction and less energy consumption ✅ Improving LLM Serving eﬀiciency ✅ Enhancing LLMsʼ performance with longer contexts

GitHub: Awesome-KV-Cache Related research paper on KV Cache: https://github.com/zcli-charlie/Awesome-KV-Cache

Optimization

https://github.com/zcli-charlie/Awesome-KV-Cache/blob/main/assets/Main.png

Optimization: 3 Main Stages Post-Training Stage • Eviction • Quantization
Training Stage • Most effective • Not suitable for modifying existing models • Not suitable for low computational power • Architecture changes Deployment Stage • Optimizing KV-Cache

1. Training Stage

https://github.com/zcli-charlie/Awesome-KV-Cache/blob/main/assets/Main.png 1

Training Stage • The most eﬀective KV-Cache compression method. •
Related to LLM architecture changes. • Could not apply to the existing models. MHA (Multi-Head Attention) MQA (Multi-Query Attention) GQA (Grouped Query Attention)

Comparison between MHA, MQA and GQA MHA (Multi-Head Attention) MQA
(Multi-Query Attention) GQA (Grouped Query Attention)

2. Deployment Stage

Deployment Stage • Page Attention Kwon et al. (2023) vLLM
framework for high performance LLM serving. • DistAttention (DistKV-LLM) Lin et al. (2024) KV-Cache was able to achieve distributed deployment on multiple servers. This significantly improved the eﬀiciency of providing LLM services using large-scale cloud servers • ChunkAttention Ye et al. (2024) made the model avoid repeated calculation of some tokens in the pre-fill stage, speeding up the response speed of the deployment system • InfLLM Jin et al. (2024) allows large models to achieve near-infinite context without additional training and uses very little additional KV-Cache.

vLLM: LLM Serving with Paged Attention https://github.com/vllm-project/vllm https://arxiv.org/abs/2309.06180

3. Post-Training Stage

Post-Training Stage Eviction methods are about the policies to discard
unnecessary tokens. Two lines of approaches exist: static policies, which are designed before inference and remains consistent across every inference request, and dynamic policies, which utilize the information generated while inference to identify important tokens.

Eviction Policy

Evaluation

Datasets • Longbench (Bai et al., 2023) is the first
bilingual (English and Chinese) multitask benchmark for long context understanding. It consists of 21 datasets across 6 task categories: single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion. • Passkey retrieval (Mohtashami & Jaggi, 2023) proposed the passkey retrieval task. The models are required to retrieve a random passkey hidden in a long document. • Needle in a Haystack (Kuratov et al., 2024). They proposed BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. BABILong hides algorithmically generated question answering and reasoning problems inside a corpus of book texts. BABILong consists of 20 tasks designed for evaluation of basic aspects of reasoning. • Few-shot Testing (Brown et al., 2020) format or by simulating multi-turn dialogues, in order to test the modelʼs capabilities with long texts. Furthermore, for some inference-type tests, the Chain-of-Thought (CoT) strategy proposed in Wei et al. (2022) can be adopted to further increase the length of few-shot texts.

Evaluation Metrics • Per Token GPU-Memory Usage For KV Cache,
the most intuitive optimization indicator would be the memory space occupied by each token. The LLaMA2-7B model, as a typical example, theoretically occupies 0.5MB of memory for each KV Cache entry. • Throughput and Latency Throughput, usually measured in tokens per second (token/s), represents how many new tokens the model can generate per second. In the Decoding phase, latency is usually considered to be the time required to generate each new token, typically in milliseconds. • Perplexity (PPL) For each token, the model calculates the natural logarithm likelihood value of the probability distribution predicted based on its previous tokens, takes the average, and then calculates the value as the exponent of e, where ANLL refers to the average natural logarithm likelihood. PPL can provide a rough reference for the performance changes of the model. If PPL rises sharply, it usually means that the modelʼs ability has significantly decreased, such as completely losing language ability, etc.

Conclusion

Key Takeaway • Principles of KV Cache Optimization: The main
goal of KV Cache optimization is to reduce memory usage by compressing the Keys and Values in the KV pairs. • Trade-offs in Deletion vs. Compression: There's a trade-off between deleting less important KV pairs to save memory and compressing the entire KV Cache without deletion. The former may impact model performance while the latter focuses on retaining information. • Extremes in KV Cache Management: A potential future direction is to store the KV Cache externally, turning KV Cache management into a retrieval task. • Future Directions in Storage and Retrieval Technologies: The future of LLMs will likely see storage and retrieval technologies becoming as important as the computational models themselves, opening new possibilities for LLM efficiency and versatility.

https://github.com/zcli-charlie/Awesome-KV-Cache/blob/main/assets/Main.png

Reference Shi, L., Zhang, H., Yao, Y., Li, Z., &
Zhao, H. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption (arXiv:2407.18003). arXiv. https://doi.org/10.48550/arXiv.2407.18003

GitHub: Awesome-KV-Cache https://github.com/zcli-charlie/Awesome-KV-Cache

Keep the Cost Down: A Review on Methods to Opti...

Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption

More Decks by Kamolphan Liwprasert

Other Decks in Research

Featured

Transcript