Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Keep the Cost Down: A Review on Methods to Opti...

Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption

Original paper:
Luohe, S., Hongyi, Z., Yao, Y., Zuchao, L., & Hai, Z. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM's KV-Cache Consumption. arXiv preprint arXiv:2407.18003.
https://arxiv.org/abs/2407.18003

Presented in Seminar 2024/1 at Chulalongkorn University.

Kamolphan Liwprasert

August 19, 2024
Tweet

More Decks by Kamolphan Liwprasert

Other Decks in Research

Transcript

  1. Keep the Cost Down: A Review on Methods to Optimize

    LLM’s KV-Cache Consumption Presented by Kamolphan Liwprasert 2024-08-19 Shi, L., Zhang, H., Yao, Y., Li, Z., & Zhao, H. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption (arXiv:2407.18003). arXiv. https://doi.org/10.48550/arXiv.2407.18003
  2. Background • LLMs are now widely used, but their efficiency

    is challenged by the Transformer architectureʼs struggle with handling long texts. • KV-Cache has emerged as a pivotal solution to this issue ✅ Converting the time complexity of token generation from quadratic to linear ❌ Increased GPU memory overhead proportional to conversation length Shi, L., Zhang, H., Yao, Y., Li, Z., & Zhao, H. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption (arXiv:2407.18003). arXiv. https://doi.org/10.48550/arXiv.2407.18003
  3. Goals 1. Optimizing KV-Cache space usage of LLM: pre-training, deployment,

    inference phases 2. Review of the landscape of LLM Optimization 3. Metrics for evaluating the long-text capabilities of large language models, from both efficiency and capability perspectives
  4. Challenges of LLM 🔥 Decoder Only Transformer architecture has a

    quadratic time complexity when understanding text sequences. 🔥 During inference, the auto-regressive decoding mechanism amplifies this issue, as it repeats the process for each token generated.
  5. What is KV-Cache? KV-Cache, by storing the keys and values

    tensor in attention module generated by past tokens, can reduce the time complexity required to generate each token to linear, greatly improving inference efficiency. KV-Cache is a mechanism that leverages the properties of causal masking in MHA (Multi-Head Attention) to store and reuse intermediate computations, thereby optimizing the efficiency of LLMs, especially when dealing with long sequences.
  6. How does KV-Cache work? 1. Token production: Each token produces

    intermediate K and V tensors. 2. Token generation: When generating subsequent tokens, the KV tensors of preceding tokens are required to compute the self-attention. 3. KV caching: These K and V tensors are cached in GPUs, which is known as the KV cache. Gao, B., He, Z., Sharma, P., Kang, Q., Jevdjic, D., Deng, J., ... & Zuo, P. (2024). AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving. arXiv preprint arXiv:2403.19708.
  7. Challenges of KV-Cache 🔥 KV-Cache will increase linearly with the

    length of the sequence, and the memory required will become larger and larger, especially for giant models like GPT-3. 🔥 It is also hard to reuse the same dialogues and will become a bottleneck from the low memory bandwidth of GPU comparing to its computation.
  8. Benefit of optimizing KV-Cache ✅ Reducing memory usage, which leads

    to cost reduction and less energy consumption ✅ Improving LLM Serving efficiency ✅ Enhancing LLMsʼ performance with longer contexts
  9. Optimization: 3 Main Stages Post-Training Stage • Eviction • Quantization

    Training Stage • Most effective • Not suitable for modifying existing models • Not suitable for low computational power • Architecture changes Deployment Stage • Optimizing KV-Cache
  10. Training Stage • The most effective KV-Cache compression method. •

    Related to LLM architecture changes. • Could not apply to the existing models. MHA (Multi-Head Attention) MQA (Multi-Query Attention) GQA (Grouped Query Attention)
  11. Comparison between MHA, MQA and GQA MHA (Multi-Head Attention) MQA

    (Multi-Query Attention) GQA (Grouped Query Attention)
  12. Deployment Stage • Page Attention Kwon et al. (2023) vLLM

    framework for high performance LLM serving. • DistAttention (DistKV-LLM) Lin et al. (2024) KV-Cache was able to achieve distributed deployment on multiple servers. This significantly improved the efficiency of providing LLM services using large-scale cloud servers • ChunkAttention Ye et al. (2024) made the model avoid repeated calculation of some tokens in the pre-fill stage, speeding up the response speed of the deployment system • InfLLM Jin et al. (2024) allows large models to achieve near-infinite context without additional training and uses very little additional KV-Cache.
  13. Post-Training Stage Eviction methods are about the policies to discard

    unnecessary tokens. Two lines of approaches exist: static policies, which are designed before inference and remains consistent across every inference request, and dynamic policies, which utilize the information generated while inference to identify important tokens.
  14. Datasets • Longbench (Bai et al., 2023) is the first

    bilingual (English and Chinese) multitask benchmark for long context understanding. It consists of 21 datasets across 6 task categories: single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion. • Passkey retrieval (Mohtashami & Jaggi, 2023) proposed the passkey retrieval task. The models are required to retrieve a random passkey hidden in a long document. • Needle in a Haystack (Kuratov et al., 2024). They proposed BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. BABILong hides algorithmically generated question answering and reasoning problems inside a corpus of book texts. BABILong consists of 20 tasks designed for evaluation of basic aspects of reasoning. • Few-shot Testing (Brown et al., 2020) format or by simulating multi-turn dialogues, in order to test the modelʼs capabilities with long texts. Furthermore, for some inference-type tests, the Chain-of-Thought (CoT) strategy proposed in Wei et al. (2022) can be adopted to further increase the length of few-shot texts.
  15. Evaluation Metrics • Per Token GPU-Memory Usage For KV Cache,

    the most intuitive optimization indicator would be the memory space occupied by each token. The LLaMA2-7B model, as a typical example, theoretically occupies 0.5MB of memory for each KV Cache entry. • Throughput and Latency Throughput, usually measured in tokens per second (token/s), represents how many new tokens the model can generate per second. In the Decoding phase, latency is usually considered to be the time required to generate each new token, typically in milliseconds. • Perplexity (PPL) For each token, the model calculates the natural logarithm likelihood value of the probability distribution predicted based on its previous tokens, takes the average, and then calculates the value as the exponent of e, where ANLL refers to the average natural logarithm likelihood. PPL can provide a rough reference for the performance changes of the model. If PPL rises sharply, it usually means that the modelʼs ability has significantly decreased, such as completely losing language ability, etc.
  16. Key Takeaway • Principles of KV Cache Optimization: The main

    goal of KV Cache optimization is to reduce memory usage by compressing the Keys and Values in the KV pairs. • Trade-offs in Deletion vs. Compression: There's a trade-off between deleting less important KV pairs to save memory and compressing the entire KV Cache without deletion. The former may impact model performance while the latter focuses on retaining information. • Extremes in KV Cache Management: A potential future direction is to store the KV Cache externally, turning KV Cache management into a retrieval task. • Future Directions in Storage and Retrieval Technologies: The future of LLMs will likely see storage and retrieval technologies becoming as important as the computational models themselves, opening new possibilities for LLM efficiency and versatility.
  17. Reference Shi, L., Zhang, H., Yao, Y., Li, Z., &

    Zhao, H. (2024). Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption (arXiv:2407.18003). arXiv. https://doi.org/10.48550/arXiv.2407.18003