Jan Hauffa - A Case Study on Retrieval-Augmented Generation for Document Q&A: Experiences and Future Perspectives 

A Case Study on Retrieval-Augmented Generation for Document Q&A: Experiences
and Future Perspectives @ Jan Hauffa

Use Case: Document Q&A 2 Ask questions about the content
of a document in natural language, receive answers in natural language. ➔ How to „teach“ an LLM new factual knowledge? LLM “Who invented the Transformer architecture?” “Ashish Vaswani et al.”

AGENDA 01 02 03 Teaching Factual Knowledge to an LLM
Retrieval-Augmented Generation Measure, Debug, Improve

01 Teaching Factual Knowledge to an LLM

First Attempt: Fine Tuning • Taking a similar approach to
“Databricks Dolly-2” (Implementation: Zaid Ur Rehman) • Base model Pythia-12b, 8-bit quantization, LoRA • Mix of supervised and self-supervised training: • German (machine) translation of the dolly-15k instruction dataset https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual • German legal texts (“Bürgerliches Gesetzbuch”, BGB) • 1x A100 (80 GB VRAM) on Azure Databricks, overall training time ~ 48h • Results • Quality of generated German text substantially lower than English text • Unable to answer even simple questions correctly: • “Was ist ein Unternehmer?” – “Unternehmer sind Personen, die eine wirtschaftliche Tätigkeit durchführen, die nicht der Erwerbsgeschäftsfelder entspricht, also etwa durch staatliche Beihilfen oder staatliche Subventionen profitieren.” 5

Lessons Learned • Need a base model that has been
trained on a sufficient amount of German language text. • Fine tuning: What is it good for? • Suitable for controlling format and style of the expected output (“Instruction Fine Tuning”, Chung et al., 2022). • Not suitable for the acquisition of factual knowledge from arbitrary, “natural” text. • Hypothesis: Knowledge acquisition during training (and fine tuning) requires repeated exposure to the new facts in different contexts, phrased differently. • “Textbooks Are All You Need” (Gunasekar et al., 2023): Teach a model to write Python code using a training dataset of synthetic textbooks. • “Curse of reversal”: “[an LLM] does not increase the probability P(b = a) after training on a = b” – data augmentation by paraphrasing helps (Berglund et al. 2023) 6

Zero Shot, Few Shot, and In-Context Learning • Zero Shot
Learning: Prompt contains the task description • Translate English to German. English: How are you? German: • Few Shot Learning: Prompt contains task description and a small number of examples • Translate English to German. English: Good morning! German: Guten Morgen! English: How are you? German: • Why does this work? • “During unsupervised pre-training, a language model develops a broad set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize the desired task.” (Brown et al., 2020) → In-Context Learning 7

Document Q&A as Zero Shot Learning? • Idea: Construct a
prompt from the content of the document and the question to be answered. • Advantages: • Model no longer has to provide factual knowledge, just language skills and basic reasoning. • No expensive fine tuning for each new document. • One important caveat: • Limited context length of Transformer models → This only works for short documents! • Typical “open” LLMs: 2K tokens, 1-4 pages of text • Attention mechanism has quadratic complexity (time and space!) in the number of tokens. • Diverse attempts to “fix” attention: Efficient attention (Tay et al., 2020), RoPE (Su et al. 2021), YaRN (Peng et al., 2023) – but no definitive solution yet. ➔ What about Q&A with 2 documents? 10 documents? An entire library? 8

02 Retrieval-Augmented Generation 9

Retrieval-Augmented Generation for Document Q&A • Two steps: Indexing, Question
Answering • Indexing: 1. Split an arbitrary number of documents into small chunks. 2. For each chunk, compute a semantic embedding. 3. Store the chunks and their embeddings in a vector database. • Question Answering: 1. Compute an embedding of the given question. Use asymmetric embeddings to ensure that the question is close to potential answers in the embedding space (Wang et al., 2022). 2. Perform an approximate nearest-neighbor search (ANN, e.g. Malkov and Yashunin, 2016) with the embedding of the question as the query. 3. Construct a prompt for an LLM from the top-k result chunks (“context”) and the question. 10

Semantic Similarity Search using Embeddings 11

The Document Q&A Pipeline 12 „The quick brown fox jumps“
„brown fox jumps over“ „jumps over the lazy dog.“ ( ) 0 0 . . ( ) 0 0 . . ( ) 0 0 . . Documents Overlapping chunks Embeddings Vector database „What sound does a cat make?“ ( ) 0 0 . . „Dog goes ‘woof‘.“ „Cat goes ‘meow‘.“ „And the seal goes ‘ow ow ow’.“ „Read the following information: Dog goes ‘woof‘. Cat goes ‘meow‘. And the seal goes ‘ow ow ow’. Now use that information to answer the question: What sound does a cat make?“ „Cat goes ‘meow‘.“ Question Query embedding Vector database Top-k most similar chunks Prompt LLM Answer query Indexing Question Answering

Demo 13 Implementation: Hoai-Nam Tran

Demo – Technical Specs • Models • Embedding: Multilingual E5
large (Wang et al., 2022) – Currently holds rank 6 on the MTEB leaderboard (Retrieval): https://huggingface.co/spaces/mteb/leaderboard • LLM: Llama-2 13b (https://ai.meta.com/llama/) • Hardware • Runs on an NVIDIA Titan X (12 GB VRAM) with 4-bit quantization. • CPU-only and mixed CPU/GPU inference: llama.cpp (https://github.com/ggerganov/llama.cpp) • Software • Vector database: Milvus • Document management: DaSense (Text extraction, OCR, ACLs, browsing/tagging/searching) ➔Experimentation is possible, even with consumer hardware! (But you’ll want more VRAM for larger context, less quantization, larger models, …) 14

Retrieval-Augmented Generation for Everything Else (1/2) • Indexing: 1. Split
an arbitrary number of documents data into small chunks. 2. For each chunk, compute a semantic embedding. 3. Store the chunks and their embeddings in a vector database. • Question Answering Text Generation: 1. Compute an query embedding of the given question somehow. 2. Perform an approximate nearest-neighbor search (ANN) with the query embedding. 3. Construct a prompt for an LLM from the top-k result chunks and the question task description. 15

Retrieval-Augmented Generation for Everything Else (2/2) • Chat: • Store
individual lines of chat history, use symmetric embeddings: → Search for most similar parts of past conversation. • Translation: • Store bilingual sentence pairs, compute embeddings for sentences in source language only: → Search for most relevant examples for few-shot learning. • Summarization: • Perform clustering on chunk embeddings, find most similar chunks to a centroid: → Topical summary. • etc. 16

03 Measure, Debug, Improve 17

Evaluation • Given a “ground truth” dataset of source documents
and question/answer pairs, how to evaluate a document Q&A system? • Human evaluation: • Absolute rating: “Is the content of the generated text correct / equivalent to the reference answer?” → Accuracy • Relative rating: “Which of the two answers is more accurate / more similar to the reference?” → Ranking, Elo • Issues: Does not scale, annotators need domain knowledge, agreement, … • Evaluation by LLM: • Same questions as before, but LLM replaces human. • Stronger model assesses weaker model; GPT-4 generally acknowledged as “strongest”. • Issues: Unknown bias of GPT-4, token costs per test run, LLM drift (Chen et al., 2023) 18

Can an LLM Evaluate Itself? 19 Decompose the evaluation into
multiple related tasks: Faithfulness: Can all claims that are made in the answer be inferred from the context? Answer Relevance: Does the answer directly and appropriately address the question? Context Recall: Proportion of ground truth sentences that are consistent with context. (here: context == top-k chunks!) Context Precision: Proportion of context that is relevant for answering the question. + Direct assessment of correctness via LLM (prompting) or embedding model (semantic similarity). https://github.com/explodinggradients/ragas

Debugging 20 „The quick brown fox jumps“ „brown fox jumps
over“ „jumps over the lazy dog.“ ( ) 0 0 . . ( ) 0 0 . . ( ) 0 0 . . Documents Overlapping chunks Embeddings Vector database „What sound does a cat make?“ ( ) 0 0 . . „Dog goes ‘woof‘.“ „Cat goes ‘meow‘.“ „And the seal goes ‘ow ow ow’.“ „Read the following information: Dog goes ‘woof‘. Cat goes ‘meow‘. And the seal goes ‘ow ow ow’. Now use that information to answer the question: What sound does a cat make?“ „Cat goes ‘meow‘.“ Question Query embedding Vector database Top-k most similar chunks Prompt LLM Answer query This is not a black box!

Debugging 21 • Chunking issues? • Better heuristics for choosing
chunk boundaries, e.g. via document segmentation (Li et al., 2022) → logical units of text • Retrieval issues? • More/different embeddings: Sparse Embeddings (SPLADE; Formal et al., 2021), Hypothetical Document Embeddings (HyDE; Gao et al., 2022) • Combine ANN and traditional retrieval (e.g. BM25) • Re-ranking (Glass et al., 2022) • ANN parameter tuning • Generation issues? • Larger model, less quantization • Sampling algorithm and parameters (temperature, repetition penalty, …)

What Kind of Questions can RAG (not) Answer? (1/2) 22
• Factual questions about the content of the document • “What’s a transformer model?” • Questions that reference earlier questions • “Who invented it?” • → Store past questions/answers in prompt or vector DB. • Questions about meta-data • “Which section contains the definition of the Transformer architecture?” • “How long is the paper?” • → Store meta-data in vector DB, per document or inside the chunks. • Questions that require background knowledge • “How was the Transformer architecture received by the scientific community?” • → Store supplementary documents in vector DB.

What Kind of Questions can RAG (not) Answer? (2/2) 23
• Questions about tables, photos, illustrations • “Which model achieves the best average BLEU score?” • “What happens to the product of Q and K?” • → Would need a multimodal LLM like LLaVA (Liu et al., 2023), but also multimodal embeddings! • Hack: Generate a textual description and embed that. • Questions that cannot be answered based on single chunks • “Give me a short summary of the paper!” • “Can you translate the text into French?” • Asking for opinions, unrelated questions, insults, random nonsense,… ➔ What kind of questions are my users asking?

Thank you! Contact us: Jan Hauffa (jah), Zaid Ur Rehman
(zur), Hoai-Nam Tran (hnt) <initials>@norcom.de NorCom Information Technology GmbH & Co. KGaA Gabelsbergerstraße 4 80333 München T + 49 (0) 89 939 48 0 F + 49 (0) 89 939 48 111 E [email protected] 24

References (1/2) • Gunasekar et al., 2023: Textbooks Are All
You Need. https://arxiv.org/abs/2306.11644 • Berglund et al., 2023: Taken out of context: On measuring situational awareness in LLMs. https://arxiv.org/abs/2309.00667 • Chung et al., 2022: Scaling Instruction-Finetuned Language Models. https://arxiv.org/abs/2210.11416 • Brown et al., 2020: Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165 • Tay et al., 2020: Efficient Transformers: A Survey. https://arxiv.org/abs/2009.06732 • Su et al., 2021: RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864 • Peng et al., 2023: YaRN: Efficient Context Window Extension of Large Language Models. https://arxiv.org/abs/2309.00071 • Wang et al., 2022: Text Embeddings by Weakly-Supervised Contrastive Pre-training. https://arxiv.org/abs/2212.03533 25

References (2/2) • Malkov and Yashunin, 2016: Efficient and robust
approximate nearest neighbor search using Hierarchical Navigable Small World graphs. https://arxiv.org/abs/1603.09320 • Chen et al., 2023: How is ChatGPT's behavior changing over time? https://arxiv.org/abs/2307.09009 • Li et al., 2022: DiT: Self-supervised Pre-training for Document Image Transformer. https://arxiv.org/abs/2203.02378 • Formal et al., 2021: SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. https://arxiv.org/abs/2107.05720 • Gao et al., 2022: Precise Zero-Shot Dense Retrieval without Relevance Labels. https://arxiv.org/abs/2212.10496 • Glass et al., 2022: Re2G: Retrieve, Rerank, Generate. Proceedings of NAACL-HLT. https://aclanthology.org/2022.naacl-main.194 • Liu et al., 2023: Visual Instruction Tuning. https://arxiv.org/abs/2304.08485 26

Jan Hauffa - A Case Study on Retrieval-Augmente...

Jan Hauffa - A Case Study on Retrieval-Augmented Generation for Document Q&A: Experiences and Future Perspectives

MunichDataGeeks

More Decks by MunichDataGeeks

Other Decks in Technology

Featured

Transcript

A Case Study on Retrieval-Augmented Generation for Document Q&A: Experiences

Use Case: Document Q&A 2 Ask questions about the content

AGENDA 01 02 03 Teaching Factual Knowledge to an LLM

01 Teaching Factual Knowledge to an LLM

First Attempt: Fine Tuning • Taking a similar approach to

Lessons Learned • Need a base model that has been

Zero Shot, Few Shot, and In-Context Learning • Zero Shot

Document Q&A as Zero Shot Learning? • Idea: Construct a

02 Retrieval-Augmented Generation 9

Retrieval-Augmented Generation for Document Q&A • Two steps: Indexing, Question

Semantic Similarity Search using Embeddings 11

The Document Q&A Pipeline 12 „The quick brown fox jumps“

Demo 13 Implementation: Hoai-Nam Tran

Demo – Technical Specs • Models • Embedding: Multilingual E5

Retrieval-Augmented Generation for Everything Else (1/2) • Indexing: 1. Split

Retrieval-Augmented Generation for Everything Else (2/2) • Chat: • Store

03 Measure, Debug, Improve 17

Evaluation • Given a “ground truth” dataset of source documents

Can an LLM Evaluate Itself? 19 Decompose the evaluation into

Debugging 20 „The quick brown fox jumps“ „brown fox jumps

Debugging 21 • Chunking issues? • Better heuristics for choosing

What Kind of Questions can RAG (not) Answer? (1/2) 22

What Kind of Questions can RAG (not) Answer? (2/2) 23

Thank you! Contact us: Jan Hauffa (jah), Zaid Ur Rehman

References (1/2) • Gunasekar et al., 2023: Textbooks Are All

References (2/2) • Malkov and Yashunin, 2016: Efficient and robust