Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jan Hauffa - A Case Study on Retrieval-Augmente...

Jan Hauffa - A Case Study on Retrieval-Augmented Generation for Document Q&A: Experiences and Future Perspectives



Neural language models based on the Transformer architecture have been successfully applied to a wide range of Natural Language Generation tasks, but are held back by their limited context length, that is, their inability to simultaneously “pay attention” to all parts of a long document. Retrieval-Augmented Generation (RAG) is currently the most promising approach to overcome this limitation. By means of semantic similarity search, one can identify the parts of a document that are most relevant to the task at hand, and use only those parts as input to the language model.
In this talk, I demonstrate how RAG can be used to build a system for answering arbitrary questions, posed in natural language, about the content of documents (“document Q&A”). I discuss the challenges we faced when implementing document Q&A at NorCom, how to improve the performance of a document Q&A system, and how to reliably measure the performance in the first place.

MunichDataGeeks

October 31, 2023
Tweet

More Decks by MunichDataGeeks

Other Decks in Technology

Transcript

  1. Use Case: Document Q&A 2 Ask questions about the content

    of a document in natural language, receive answers in natural language. ➔ How to „teach“ an LLM new factual knowledge? LLM “Who invented the Transformer architecture?” “Ashish Vaswani et al.”
  2. AGENDA 01 02 03 Teaching Factual Knowledge to an LLM

    Retrieval-Augmented Generation Measure, Debug, Improve
  3. First Attempt: Fine Tuning • Taking a similar approach to

    “Databricks Dolly-2” (Implementation: Zaid Ur Rehman) • Base model Pythia-12b, 8-bit quantization, LoRA • Mix of supervised and self-supervised training: • German (machine) translation of the dolly-15k instruction dataset https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual • German legal texts (“Bürgerliches Gesetzbuch”, BGB) • 1x A100 (80 GB VRAM) on Azure Databricks, overall training time ~ 48h • Results • Quality of generated German text substantially lower than English text • Unable to answer even simple questions correctly: • “Was ist ein Unternehmer?” – “Unternehmer sind Personen, die eine wirtschaftliche Tätigkeit durchführen, die nicht der Erwerbsgeschäftsfelder entspricht, also etwa durch staatliche Beihilfen oder staatliche Subventionen profitieren.” 5
  4. Lessons Learned • Need a base model that has been

    trained on a sufficient amount of German language text. • Fine tuning: What is it good for? • Suitable for controlling format and style of the expected output (“Instruction Fine Tuning”, Chung et al., 2022). • Not suitable for the acquisition of factual knowledge from arbitrary, “natural” text. • Hypothesis: Knowledge acquisition during training (and fine tuning) requires repeated exposure to the new facts in different contexts, phrased differently. • “Textbooks Are All You Need” (Gunasekar et al., 2023): Teach a model to write Python code using a training dataset of synthetic textbooks. • “Curse of reversal”: “[an LLM] does not increase the probability P(b = a) after training on a = b” – data augmentation by paraphrasing helps (Berglund et al. 2023) 6
  5. Zero Shot, Few Shot, and In-Context Learning • Zero Shot

    Learning: Prompt contains the task description • Translate English to German. English: How are you? German: • Few Shot Learning: Prompt contains task description and a small number of examples • Translate English to German. English: Good morning! German: Guten Morgen! English: How are you? German: • Why does this work? • “During unsupervised pre-training, a language model develops a broad set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize the desired task.” (Brown et al., 2020) → In-Context Learning 7
  6. Document Q&A as Zero Shot Learning? • Idea: Construct a

    prompt from the content of the document and the question to be answered. • Advantages: • Model no longer has to provide factual knowledge, just language skills and basic reasoning. • No expensive fine tuning for each new document. • One important caveat: • Limited context length of Transformer models → This only works for short documents! • Typical “open” LLMs: 2K tokens, 1-4 pages of text • Attention mechanism has quadratic complexity (time and space!) in the number of tokens. • Diverse attempts to “fix” attention: Efficient attention (Tay et al., 2020), RoPE (Su et al. 2021), YaRN (Peng et al., 2023) – but no definitive solution yet. ➔ What about Q&A with 2 documents? 10 documents? An entire library? 8
  7. Retrieval-Augmented Generation for Document Q&A • Two steps: Indexing, Question

    Answering • Indexing: 1. Split an arbitrary number of documents into small chunks. 2. For each chunk, compute a semantic embedding. 3. Store the chunks and their embeddings in a vector database. • Question Answering: 1. Compute an embedding of the given question. Use asymmetric embeddings to ensure that the question is close to potential answers in the embedding space (Wang et al., 2022). 2. Perform an approximate nearest-neighbor search (ANN, e.g. Malkov and Yashunin, 2016) with the embedding of the question as the query. 3. Construct a prompt for an LLM from the top-k result chunks (“context”) and the question. 10
  8. The Document Q&A Pipeline 12 „The quick brown fox jumps“

    „brown fox jumps over“ „jumps over the lazy dog.“ ( ) 0 0 . . ( ) 0 0 . . ( ) 0 0 . . Documents Overlapping chunks Embeddings Vector database „What sound does a cat make?“ ( ) 0 0 . . „Dog goes ‘woof‘.“ „Cat goes ‘meow‘.“ „And the seal goes ‘ow ow ow’.“ „Read the following information: Dog goes ‘woof‘. Cat goes ‘meow‘. And the seal goes ‘ow ow ow’. Now use that information to answer the question: What sound does a cat make?“ „Cat goes ‘meow‘.“ Question Query embedding Vector database Top-k most similar chunks Prompt LLM Answer query Indexing Question Answering
  9. Demo – Technical Specs • Models • Embedding: Multilingual E5

    large (Wang et al., 2022) – Currently holds rank 6 on the MTEB leaderboard (Retrieval): https://huggingface.co/spaces/mteb/leaderboard • LLM: Llama-2 13b (https://ai.meta.com/llama/) • Hardware • Runs on an NVIDIA Titan X (12 GB VRAM) with 4-bit quantization. • CPU-only and mixed CPU/GPU inference: llama.cpp (https://github.com/ggerganov/llama.cpp) • Software • Vector database: Milvus • Document management: DaSense (Text extraction, OCR, ACLs, browsing/tagging/searching) ➔Experimentation is possible, even with consumer hardware! (But you’ll want more VRAM for larger context, less quantization, larger models, …) 14
  10. Retrieval-Augmented Generation for Everything Else (1/2) • Indexing: 1. Split

    an arbitrary number of documents data into small chunks. 2. For each chunk, compute a semantic embedding. 3. Store the chunks and their embeddings in a vector database. • Question Answering Text Generation: 1. Compute an query embedding of the given question somehow. 2. Perform an approximate nearest-neighbor search (ANN) with the query embedding. 3. Construct a prompt for an LLM from the top-k result chunks and the question task description. 15
  11. Retrieval-Augmented Generation for Everything Else (2/2) • Chat: • Store

    individual lines of chat history, use symmetric embeddings: → Search for most similar parts of past conversation. • Translation: • Store bilingual sentence pairs, compute embeddings for sentences in source language only: → Search for most relevant examples for few-shot learning. • Summarization: • Perform clustering on chunk embeddings, find most similar chunks to a centroid: → Topical summary. • etc. 16
  12. Evaluation • Given a “ground truth” dataset of source documents

    and question/answer pairs, how to evaluate a document Q&A system? • Human evaluation: • Absolute rating: “Is the content of the generated text correct / equivalent to the reference answer?” → Accuracy • Relative rating: “Which of the two answers is more accurate / more similar to the reference?” → Ranking, Elo • Issues: Does not scale, annotators need domain knowledge, agreement, … • Evaluation by LLM: • Same questions as before, but LLM replaces human. • Stronger model assesses weaker model; GPT-4 generally acknowledged as “strongest”. • Issues: Unknown bias of GPT-4, token costs per test run, LLM drift (Chen et al., 2023) 18
  13. Can an LLM Evaluate Itself? 19 Decompose the evaluation into

    multiple related tasks: Faithfulness: Can all claims that are made in the answer be inferred from the context? Answer Relevance: Does the answer directly and appropriately address the question? Context Recall: Proportion of ground truth sentences that are consistent with context. (here: context == top-k chunks!) Context Precision: Proportion of context that is relevant for answering the question. + Direct assessment of correctness via LLM (prompting) or embedding model (semantic similarity). https://github.com/explodinggradients/ragas
  14. Debugging 20 „The quick brown fox jumps“ „brown fox jumps

    over“ „jumps over the lazy dog.“ ( ) 0 0 . . ( ) 0 0 . . ( ) 0 0 . . Documents Overlapping chunks Embeddings Vector database „What sound does a cat make?“ ( ) 0 0 . . „Dog goes ‘woof‘.“ „Cat goes ‘meow‘.“ „And the seal goes ‘ow ow ow’.“ „Read the following information: Dog goes ‘woof‘. Cat goes ‘meow‘. And the seal goes ‘ow ow ow’. Now use that information to answer the question: What sound does a cat make?“ „Cat goes ‘meow‘.“ Question Query embedding Vector database Top-k most similar chunks Prompt LLM Answer query This is not a black box!
  15. Debugging 21 • Chunking issues? • Better heuristics for choosing

    chunk boundaries, e.g. via document segmentation (Li et al., 2022) → logical units of text • Retrieval issues? • More/different embeddings: Sparse Embeddings (SPLADE; Formal et al., 2021), Hypothetical Document Embeddings (HyDE; Gao et al., 2022) • Combine ANN and traditional retrieval (e.g. BM25) • Re-ranking (Glass et al., 2022) • ANN parameter tuning • Generation issues? • Larger model, less quantization • Sampling algorithm and parameters (temperature, repetition penalty, …)
  16. What Kind of Questions can RAG (not) Answer? (1/2) 22

    • Factual questions about the content of the document • “What’s a transformer model?” • Questions that reference earlier questions • “Who invented it?” • → Store past questions/answers in prompt or vector DB. • Questions about meta-data • “Which section contains the definition of the Transformer architecture?” • “How long is the paper?” • → Store meta-data in vector DB, per document or inside the chunks. • Questions that require background knowledge • “How was the Transformer architecture received by the scientific community?” • → Store supplementary documents in vector DB.
  17. What Kind of Questions can RAG (not) Answer? (2/2) 23

    • Questions about tables, photos, illustrations • “Which model achieves the best average BLEU score?” • “What happens to the product of Q and K?” • → Would need a multimodal LLM like LLaVA (Liu et al., 2023), but also multimodal embeddings! • Hack: Generate a textual description and embed that. • Questions that cannot be answered based on single chunks • “Give me a short summary of the paper!” • “Can you translate the text into French?” • Asking for opinions, unrelated questions, insults, random nonsense,… ➔ What kind of questions are my users asking?
  18. Thank you! Contact us: Jan Hauffa (jah), Zaid Ur Rehman

    (zur), Hoai-Nam Tran (hnt) <initials>@norcom.de NorCom Information Technology GmbH & Co. KGaA Gabelsbergerstraße 4 80333 München T + 49 (0) 89 939 48 0 F + 49 (0) 89 939 48 111 E [email protected] 24
  19. References (1/2) • Gunasekar et al., 2023: Textbooks Are All

    You Need. https://arxiv.org/abs/2306.11644 • Berglund et al., 2023: Taken out of context: On measuring situational awareness in LLMs. https://arxiv.org/abs/2309.00667 • Chung et al., 2022: Scaling Instruction-Finetuned Language Models. https://arxiv.org/abs/2210.11416 • Brown et al., 2020: Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165 • Tay et al., 2020: Efficient Transformers: A Survey. https://arxiv.org/abs/2009.06732 • Su et al., 2021: RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864 • Peng et al., 2023: YaRN: Efficient Context Window Extension of Large Language Models. https://arxiv.org/abs/2309.00071 • Wang et al., 2022: Text Embeddings by Weakly-Supervised Contrastive Pre-training. https://arxiv.org/abs/2212.03533 25
  20. References (2/2) • Malkov and Yashunin, 2016: Efficient and robust

    approximate nearest neighbor search using Hierarchical Navigable Small World graphs. https://arxiv.org/abs/1603.09320 • Chen et al., 2023: How is ChatGPT's behavior changing over time? https://arxiv.org/abs/2307.09009 • Li et al., 2022: DiT: Self-supervised Pre-training for Document Image Transformer. https://arxiv.org/abs/2203.02378 • Formal et al., 2021: SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. https://arxiv.org/abs/2107.05720 • Gao et al., 2022: Precise Zero-Shot Dense Retrieval without Relevance Labels. https://arxiv.org/abs/2212.10496 • Glass et al., 2022: Re2G: Retrieve, Rerank, Generate. Proceedings of NAACL-HLT. https://aclanthology.org/2022.naacl-main.194 • Liu et al., 2023: Visual Instruction Tuning. https://arxiv.org/abs/2304.08485 26