Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DWX 2025 - Talk to your data

DWX 2025 - Talk to your data

Slides for my talk at DWX 2025

Avatar for Sebastian Gingter

Sebastian Gingter

July 03, 2025
Tweet

More Decks by Sebastian Gingter

Other Decks in Programming

Transcript

  1. ▪ Was Sie ERWARTET ▪ Hintergrundwissen und Theorie zu RAG

    ▪ Überblick über Semantische Suche ▪ Probleme die auftreten können ▪ Pragmatische Methoden für die Verwendung eigener Daten im RAG ▪ Demos (Python) ▪ Was Sie NICHT erwartet ▪ ChatGPT, CoPilot(s) ▪ Grundlagen von ML ▪ Deep Dives in LLMs, Vektor-Datenbanken, LangChain Talk to your data: Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  2. 4 ▪ Generative AI in business settings ▪ Flexible and

    scalable backends ▪ All things .NET ▪ Pragmatic end-to-end architectures ▪ Developer productivity ▪ Software quality [email protected] @phoenixhawk https://www.thinktecture.com Sebastian Gingter Developer Consultant @ Thinktecture AG Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  3. 6 ▪ Short introduction to RAG ▪ Embeddings (and a

    bit of theory ) ▪ Indexing ▪ Retrieval ▪ Not good enough? – Indexing II ▪ HyDE & alternative indexing methods ▪ Conclusion Agenda Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  4. 8 Use case: Retrieval-augmented generation (RAG) Cleanup & Split Text

    Embedding Question Text Embedding Save Query Relevant Text Question LLM Vector DB Embedding model Embedding model Indexing / Embedding QA Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  5. 9 ▪ Search is provided as a tool to the

    LLM ▪ LLM then can decide to call the tool to search on its own ▪ LLM also decides the search term (could be problematic) Alternative: Agentic RAG Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  6. 10 ▪ Classic search: lexical ▪ Compares words, parts of

    words and variants ▪ Classic SQL: WHERE ‘content’ LIKE ‘%searchterm%’ ▪ We can search only for things where we know that it is somewhere in the text ▪ New: Semantic search ▪ Compares for the same contextual meaning ▪ “Das Rudel rollt das runde Gerät auf dem Rasen herum” ▪ “The pack enjoys rolling a round thing on the green grass” ▪ “Die Hunde spielen auf der Wiese mit dem Ball” ▪ “The dogs play with the ball on the meadow” Semantic Search Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  7. 11 ▪ How to grasp “semantics”? ▪ Computers only calculate

    on numbers ▪ Computing is “applied mathematics” ▪ AI also only calculates on numbers Semantic Search Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  8. 12 ▪ We need a numeric representation of text ▪

    Tokens ▪ We need a numeric representation of meaning ▪ Embeddings Semantic Search Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  9. 13 ▪ Similar to char tables (e.g. ASCII), just with

    larger elements ▪ Tokens are parts of text ▪ Words ▪ Syllables ▪ Punctuation ▪ … ▪ Tokens are translated to token IDs ▪ Example: https://platform.openai.com/tokenizer Tokens Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  10. 15 Embedding (math.) ▪ Topologic: Value of a high dimensional

    space is “embedded” into a lower dimensional space ▪ Natural / human language is very complex (high dimensional) ▪ Task: Map high complexity to lower complexity / dimensions ▪ Injective function ▪ Similar to hash, or a lossy compression Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  11. 16 ▪ Embedding model (specialized ML model) converting text into

    a numeric representation of its meaning ▪ Representation is a Vector in an n-dimensional space ▪ n floating point values ▪ OpenAI ▪ “text-embedding-ada-002” uses 1536 dimensions ▪ “text-embedding-3-small” 512 and 1536 ▪ “text-embedding-3-large” 256, 1024 and 3072 ▪ Huggingface models have a very wide range of dimensions Embeddings https://huggingface.co/spaces/mteb/leaderboard & https://openai.com/blog/new-embedding-models-and-api-updates Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  12. 17 ▪ Embedding models are unique ▪ Each dimension has

    a different meaning, individual to the model ▪ Vectors from different models are incompatible with each other ▪ they live in different vector spaces ▪ Some embedding models are multi-language, but not all ▪ In an LLM, also the first step is to embed the input into a lower dimensional space Embeddings Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  13. 18 ▪ Mathematical quantity with a direction and length ▪

    Ԧ 𝑎 = 𝑎𝑥 𝑎𝑦 What is a vector? https://mathinsight.org/vector_introduction Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  14. 19 Vectors in 2D Ԧ 𝑎 = 𝑎𝑥 𝑎𝑦 Signifikant

    bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  15. 20 Ԧ 𝑎 = 𝑎𝑥 𝑎𝑦 𝑎𝑧 Vectors in 3D

    Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  16. 21 Ԧ 𝑎 = 𝑎𝑢 𝑎𝑣 𝑎𝑤 𝑎𝑥 𝑎𝑦 𝑎𝑧

    Vectors in multidimensional space Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  17. 23 𝐵𝑟𝑜𝑡ℎ𝑒𝑟 − 𝑀𝑎𝑛 + 𝑊𝑜𝑚𝑎𝑛 ≈ 𝑆𝑖𝑠𝑡𝑒𝑟 Word2Vec Mikolov

    et al., Google, 2013 Man Woman Brother Sister https://arxiv.org/abs/1301.3781 Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  18. 24 Embedding-Model ▪ Task: Create a vector from an input

    ▪ Extract meaning / semantics ▪ Embedding models usually are very shallow & fast Word2Vec is only two layers ▪ Similar to the first step of an LLM ▪ Convert text to values for input layer ▪ This comparison is very simplified, but one could say: ▪ The embedding model ‘maps’ the meaning into the model’s ‘brain’ Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  19. 26 [ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046

    , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ] Embedding-Model http://jalammar.github.io/illustrated-word2vec/ Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  20. 28 ▪ Select your Embedding Model carefully for your use

    case ▪ e.g. ▪ intfloat/multilingual-e5-large-instruct ~ 50 % ▪ T-Systems-onsite/german-roberta-sentence-transformer-v2 < 70 % ▪ danielheinz/e5-base-sts-en-de > 80 % ▪ BAAI/bge-m3 > 95 % ▪ Maybe fine-tuning of the embedding model might be an option ▪ As of now: Treat embedding models as exchangeable commodities! Important Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  21. 29 ▪ Embedding model: “Analog to digital converter for text”

    ▪ Embeds the high-dimensional natural language meaning into a lower dimensional-space (the model’s ‘brain’) ▪ No magic, just applied mathematics ▪ Math. representation: Vector of n dimensions ▪ Technical representation: array of floating point numbers Recap Embeddings Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  22. 31 ▪ Similarity determination ▪ Semantic search ▪ Semantic routing

    Semantically determine the knowledge base / source for a query ▪ Semantic caching Can be used to cache answers for similar search queries ▪ Categorization ▪ Keyword determination by contextual similarity ▪ etc. Other use-cases: Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  23. 33 ▪ We have a LOT of variables ▪ Chunk

    size ▪ Chunking strategy ▪ Embedding model ▪ Retrieval methods ▪ How many documents are retrieved (n) ▪ Embedding-only ▪ Hybrid search ▪ Reranking (yes / no, model, amount…) ▪ Plain RAG vs. Agentic RAG ▪ Potential transformations ▪ Potential knowledge graphs Improvement Process Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  24. 34 ▪ This is (computer) science and software engineering ▪

    We need to perfectly know what works, and what does not work ▪ We need reproducible experiments ▪ We need to measure our stuff Improvement Process Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  25. 35 ▪ Create, maintain and compare metrics ▪ One possibilty

    for Python: https://www.ragas.io/ Improvement Process Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  26. 37 Consists of ▪ Loading ▪ Clean-up ▪ Splitting ▪

    Embedding ▪ Storing Indexing Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  27. 38 ▪ Import documents from different sources, in different formats

    ▪ LangChain has very strong support for loading data ▪ Support for cleanup ▪ Support for splitting Loading https://python.langchain.com/docs/integrations/document_loaders Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  28. 39 ▪ HTML Tags ▪ Formatting information ▪ Normalization ▪

    lowercasing ▪ stemming, lemmatization ▪ remove punctuation & stop words ▪ Enrichment ▪ tagging ▪ keywords, categories ▪ metadata Clean-up Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  29. 40 ▪ Document is too large / too much content

    / not concise enough Splitting (Text Segmentation) ▪ by size (text length) ▪ by character (\n\n) ▪ by paragraph, sentence, words (until small enough) ▪ by size (tokens) ▪ overlapping chunks (token-wise) Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  30. 41 ▪ Document is too large / too much content

    / not concise enough Splitting / Chunking (Text Segmentation) ▪ by size (text length) ▪ by character (\n\n) ▪ by paragraph, sentence, words (until small enough) ▪ by size (tokens) ▪ overlapping chunks (token-wise) Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  31. 42 ▪ Every sentence gets an embedding ▪ Embeddings for

    each sentence are compared with each other ▪ When deviation is too large, we assume a meaning (topic) change ▪ At this border chunks are separated ▪ Needs a lot of vectors and comparisons ▪ Indexing gets slow & expensive Semantic Chunking Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  32. 44 Retrieval Embedding- Model Embedding 𝑎 𝑏 𝑐 … Vector-

    Database “What is the name of the teacher?” Query Doc. 1: 0.86 Doc. 2: 0.84 Doc. 3: 0.79 Weighted result … (Answer generation) Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  33. 48 ▪ Semantic search still only uses your data ▪

    It’s just as good as your embeddings ▪ All chunks need to be sized correctly and distinguishable enough ▪ Garbage in, garbage out Not good enough? Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  34. 49 ▪ Search for a hypothetical Document HyDE (Hypothetical Document

    Embedddings) LLM, e.g. GPT-3.5-turbo Embedding 𝑎 𝑏 𝑐 … Vector- Database Doc. 3: 0.86 Doc. 2: 0.81 Doc. 1: 0.81 Weighted result Hypothetical Document Embedding- Model Write a company policy that contains all information which will answer the given question: {QUERY} “What should I do, if I missed the last train?” Query https://arxiv.org/abs/2212.10496 Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  35. 50 ▪ Downside of HyDE: ▪ Each request needs to

    be transformed through an LLM (slow & expensive) ▪ A lot of requests will probably be very similar to each other ▪ Each time a different hypothetical document is generated, even for an extremely similar request ▪ Leads to very different results each time ▪ Idea: Alternative indexing ▪ Transform the document, not the query What else? Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  36. 51 Alternative Indexing HyQE: Hypothetical Question Embedding LLM, e.g. GPT-3.5-turbo

    Transformed document Write 3 questions, which are answered by the following document. Chunk of Document Embedding- Model Embedding 𝑎 𝑏 𝑐 … Vector- Database Metadata: content of original chunk Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  37. 52 ▪ Retrieval Alternative Indexing Embedding- Model Embedding 𝑎 𝑏

    𝑐 … Vector- Database Doc. 3: 0.89 Doc. 1: 0.86 Doc. 2: 0.76 Weighted result Original document from metadata “What should I do, if I missed the last train?” Query Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  38. 55 ▪ Reranking (after retrieval): Retrieve with much higher n,

    rerank, then pick new top n ▪ Agentic RAG: Provide search as tool to LLM and let LLM determine what to search for, potentially refining search terms ▪ Add additional data sources, e.g. knowledge graphs Additional strategies Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  39. 56 ▪ Index different document depths (via LLM calls) in

    addition to detailed chunks ▪ Create a summary of the complete document ▪ Create a summary of each chapter ▪ Create a summary of each paragraph ▪ Allows for more general queries instead of nitty gritty detail questions only Additional strategies Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  40. 58 Cleanup & Split Text Embedding Question Text Embedding Save

    Query Relevant Text Question LLM Vector DB Embedding model Embedding model Indexing / Embedding QA Retrieval-augmented generation (RAG) Indexing & (Semantic) search Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  41. 59 ▪ Tune text cleanup, segmentation, splitting ▪ HyDE or

    HyQE or alternative indexing ▪ How many questions? ▪ With or without summary ▪ Other approaches ▪ Only generate summary ▪ Extract “Intent” from user input and search by that ▪ Transform document and query to a common search embedding ▪ HyKSS: Hybrid Keyword and Semantic Search https://www.deg.byu.edu/papers/HyKSS.pdf ▪ Always evaluate approaches with your own data & queries ▪ The actual / final approach is more involved as it seems on the first glance Recap: Not good enough? Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  42. 60 ▪ Semantic search is a first and fast Generative

    AI business use-case ▪ Quality of results depend heavily on data quality and preparation pipeline ▪ RAG pattern can produce breathtaking good results without the need for user training Conclusion Signifikant bessere LLM-RAG-Lösungen durch Real-World Tipps "Talk to your data":
  43. “Talk to your Data”: Signigfikant bessere LLM-RAG-Lösungen durch Real-World Tipps

    Sebastian Gingter [email protected] Developer Consultant Slides & Code https://www.thinktecture.com/de/sebastian-gingter