Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A look inside the LLM closed box: test, observe...

Avatar for Mario Fusco Mario Fusco
April 13, 2025
44

A look inside the LLM closed box: test, observe and evaluate your RAG assisted chatbot

Checking the correctness of an application with an exhaustive suite of unit and integration tests is a natural task for any respectable software developer. Such a test suite also comes with other advantages like documenting the expected behavior of the application and enabling a fast feedback loop. This is all relatively straightforward when the components of your software are entirely deterministic, but how can you achieve something similar when a key part of it has a probabilistic nature?

This probabilistic nature makes it even more important to observe and collect real user inputs from production to better understand user needs and automate the evaluation of your LLM-infused application.

This talk will show in practice how to test a LLM-infused application with a mix of deterministic assertions and an LLM-as-a-judge approach. It will also demonstrate how LangChain4j 1.0 allows us to extensively observe the behavior of this application, and create a dataset out of the collected traces. Finally this dataset will be used in an evaluation framework through which assessing the performance of our RAG assisted LLM chatbot on both its retrieval and generation stages.

Avatar for Mario Fusco

Mario Fusco

April 13, 2025
Tweet

More Decks by Mario Fusco

Transcript

  1. A look inside the LLM closed box: test, observe and

    evaluate your RAG assisted chatbot by Mario Fusco & Dimitris Kafetzis
  2. … but … there’s an elephant in the room How

    can you write reliable and deterministic tests for something that works on a statistical basis???
  3. An hallucination is an inconsistency and it can happen at

    different levels: - Inconsistency within output sentences - “Mario is tall thus he is the shortest person” - Inconsistency between input and output - “Generate formal text to announce to colleagues …” “Yo boyz!” - Factually wrong - “First man on the Moon in 2024” What hallucinations are
  4. A LLM is a black box able to hallucinate. This

    because of multiple reasons: - Partial/inconsistent training data - LLMs learn how to generalize from training data assuming they are comprehensive (but we don’t train them!) - Configuration of generation can be “hallucination prone” - Sampling configurations like temperature, top_k, top_p guide creativity (but we often want LLM to be creative) - Context/input quality - The more is specific, the better (we can control this!) Why hallucinations happen
  5. An unprecedented definition of malfunctioning Given their statistical nature, GenAI

    algorithms can hallucinate or make other spectacular mistakes even when they work very well. These mistakes cannot be fixed in the same way of a software bug.
  6. Why not fixing LLM’s hallucinations using another LLM to validate

    the outcome of the first? … this reminds me of a 20 years old joke …
  7. LLM to judge another (or even the same) LLM sounds

    wrong but it is a useful tool together with Humans. It is critical to ask the right questions to detect many issues/hallucinations (but usually not factual checking): - Style/Tone: “is the response using formal language?” - Bias: “does the response show prejudice against a group?” - Sentiment: “has the text a positive sentiment?” - Consistency: “does the response stick to the provided context?” LLM as a judge
  8. Signal from tests: ❌ - stuff needs fixing ✅ -

    confident to release Purpose of tests: ❌ - prevent breaking prod ✅ - continuously improve your app
  9. Testing Applications Whenever we build a software system, we need

    to ensure that it behaves in a way that matches our intentions. With traditional systems, we do this primarily through testing. We provide a thoughtfully selected sample of input, and verified that the system responds in the way we expect.
  10. Testing AI-Infused Applications With LLM-based systems, we encounter a system

    that no longer behaves deterministically. Such a system will provide different outputs to the same inputs on repeated requests. This doesn't mean we cannot examine its behavior to ensure it matches our intentions, but it does mean we have to think about it differently.
  11. Come on … how hard it can be? :) @ParameterizedTest

    @MethodSource("modelsSupportingImageInputs") @EnabledIf("supportsSingleImageInputAsPublicURL") protected void should_accept_single_image_as_public_URL(M model) { // given UserMessage userMessage = UserMessage.from( TextContent.from("What do you see?"), ImageContent.from(CAT_IMAGE_URL) ); ChatRequest chatRequest = ChatRequest.builder() .messages(userMessage) .build(); // when ChatResponse chatResponse = chat(model, chatRequest).chatResponse(); // then AiMessage aiMessage = chatResponse.aiMessage(); assertThat(aiMessage.text()).containsIgnoringCase("cat"); assertThat(aiMessage.toolExecutionRequests()).isNull(); } https://upload.wikimedia.org/wikipedia/ commons/e/e9/Felis_silvestris_silvestri s_small_gradual_decrease_of_quality.png
  12. Come on … how hard it can be? :) @ParameterizedTest

    @MethodSource("modelsSupportingImageInputs") @EnabledIf("supportsSingleImageInputAsPublicURL") protected void should_accept_single_image_as_public_URL(M model) { // given UserMessage userMessage = UserMessage.from( TextContent.from("What do you see?"), ImageContent.from(CAT_IMAGE_URL) ); ChatRequest chatRequest = ChatRequest.builder() .messages(userMessage) .build(); // when ChatResponse chatResponse = chat(model, chatRequest).chatResponse(); // then AiMessage aiMessage = chatResponse.aiMessage(); assertThat(aiMessage.text()).containsIgnoringCase("cat"); assertThat(aiMessage.toolExecutionRequests()).isNull(); } https://upload.wikimedia.org/wikipedia/ commons/e/e9/Felis_silvestris_silvestri s_small_gradual_decrease_of_quality.png [ERROR] Failures: [ERROR] OpenAiChatModelIT>AbstractBaseChatModelIT.should_ accept_single_image_as_public_URL:1131 Expecting actual: "I see an animal with a feline appearance. It has a distinctive striped coat and green eyes. The background appears blurred or out of focus." to contain: "cat" (ignoring case)
  13. A simple AI-infused app @RegisterAiService public interface CategoryRouter { @UserMessage("""

    Analyze the user request and categorize It as 'legal', 'medical' or 'technical'. Reply with only one of those words. The user request is: '{request}'. """) RequestCategory classify(String request); } public enum RequestCategory { LEGAL, MEDICAL, TECHNICAL, UNKNOWN } @GET @Path("request/{request}") @Produces(MediaType.TEXT_PLAIN) public String assist(String request) { return routerService.findExpert(request).apply(request); } public UnaryOperator<String> findExpert(String request) { return switch (categoryRouter.classify(request)) { case LEGAL -> legalExpert::chat; case MEDICAL -> medicalExpert::chat; case TECHNICAL -> technicalExpert::chat; default -> ignore -> "I cannot find an appropriate category."; }; }
  14. … and how to test it @QuarkusTest @AiScorer public class

    CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency>
  15. … and how to test it @QuarkusTest @AiScorer public class

    CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The Scorer is a utility
  16. … and how to test it @QuarkusTest @AiScorer public class

    CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The Scorer is a utility that evaluates a set of samples
  17. … and how to test it @QuarkusTest @AiScorer public class

    CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The Scorer is a utility that evaluates a set of samples against a function
  18. … and how to test it @QuarkusTest @AiScorer public class

    CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The Scorer is a utility that evaluates a set of samples against a function and a set of evaluation strategies public interface EvaluationStrategy<T> { boolean evaluate(EvaluationSample<T> sample, T output); }
  19. Built-in Evaluation Strategies new SemanticSimilarityStrategy(embeddingModel, 0.85); <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing- scorer-semantic-similarity</artifactId>

    <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The SemanticSimilarityStrategy evaluates the similarity between the actual output and the expected output using cosine similarity. It requires an embedding model and a minimum similarity threshold.
  20. Built-in Evaluation Strategies new SemanticSimilarityStrategy(embeddingModel, 0.85); <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing- scorer-semantic-similarity</artifactId>

    <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The SemanticSimilarityStrategy evaluates the similarity between the actual output and the expected output using cosine similarity. It requires an embedding model and a minimum similarity threshold. <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing- scorer-ai-judge</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> EvaluationStrategy<String> strategy = new AiJudgeStrategy(myModel, """ You are an AI evaluating a response and the expected output. You need to evaluate if the model response is correct or not. Return true if the response is correct, false otherwise. Response to evaluate: {response} Expected output: {expected_output} """); The AiJudgeStrategy (LLM-as-a-judge) leverages an AI model to determine if the actual output matches the expected output.
  21. What is RAG? ❖ Retrieve information from a database ❖

    Augment user query with retrieved data ❖ Generate content from the augmented context
  22. What is RAG and what does it mean to evaluate

    it? ❖ Retrieve information from a database ❖ Augment user query with retrieved data ❖ Generate content from the augmented context ❖ Retrieval evaluation ➢ Are the retrieved documents relevant? (Precision) ➢ How many of the relevant documents are retrieved? (Recall) ❖ Generation evaluation ➢ What is the quality of the final generated output?
  23. What is a dataset? ❖ A curated set of data

    entries used to evaluate aspects of the application ❖ Usually the more we have the better, but quality is important Question What is the capital of Germany? Retrieved Contexts "Berlin is the capital and largest city of Germany." Ground Truth Answer Berlin
  24. How to make a dataset ❖ Define ➢ aspects of

    the application you want to evaluate ➢ possible scenarios you want to test ➢ what data is needed to test them ❖ Collect samples of the data that best represent the scenarios from ➢ the real world ➢ through synthetic generation ❖ Aim for a dataset that is ➢ large enough to provide meaningful insights ➢ not so large that it becomes unwieldy ❖ Ensure that the data ➢ is of high quality ➢ accurately reflects the scenarios under evaluation
  25. How to make a dataset (initial log entry) { "insertId":

    "rroswqf9rk00w", "jsonPayload": { "message": "1a3efd1b-6aa2-41e2-92c2-56f58471acf3;;;a9a840e2-521c-4b3b-82a0-80bee1ca8122;;;question;;;If I get Input is too long for requested model when sending data to sonnet 3.5, how can I manage this? split the text and do more than call or how ?" }, … }, "timestamp": "2024-10-14T07:40:44.646Z", "severity": "INFO", "labels": { "loggerName": "com.vaadin.demo.views.ChatView", "levelValue": "20000", "levelName": "INFO" }, "logName": "projects/langchain4j/logs/application-log", "receiveTimestamp": "2024-10-14T07:40:44.833940511Z" } … { "insertId": "8d5lksf9xmvie", "jsonPayload": { "message": "1a3efd1b-6aa2-41e2-92c2-56f58471acf3;;;retrieved;;;tutorials\\1-chat-and-language-models.md;7;;integrations\\language-models\\jlama.md;7;;tutorials\\6-tools.md;6" }, … }, { "insertId": "3bya7df19n47j", "jsonPayload": { "message": "1a3efd1b-6aa2-41e2-92c2-56f58471acf3;;;a9a840e2-521c-4b3b-82a0-80bee1ca8122;;;answer;;;If you encounter the error \"Input is too long for requested model\" when sending data to the Sonnet 3.5 model, you can manage this by splitting the text into smaller chunks and making multiple calls to the model. Since LLMs do not maintain the state of the conversation, you'll need to keep track of the context yourself.\n\nHere’s how you can handle it:\n\n1. **Split the Text:** Break your input into smaller, manageable pieces that fit within the model's input limits.\n\n2. **Maintain Context:** When making multiple calls, ensure you keep the relevant context from previous interactions. This may involve including previous user and AI messages in each new request.\n\n3. **Use `UserMessage`:** Create a `UserMessage` for each chunk of text. You can use `new UserMessage(\"Your text here\")` or `UserMessage.from(\"Your text here\")` to do this.\n\n4. **Call the Model:** For each chunk, generate a response using the model, like so:\n ```java\n UserMessage userMessage = UserMessage.from(\"Your chunk of text here\");\n Response<AiMessage> response = model.generate(singletonList(userMessage), toolSpecifications);\n AiMessage aiMessage = response.content();\n ```\n\n5. **Combine Responses:** After processing all chunks, combine the responses to form a complete output.\n\nBy following these steps, you can effectively manage long inputs and maintain a coherent conversation with the model." }, … }
  26. How to make a dataset (cleaned up DataSet entry) {

    "query": "What is langchain4j" , "ground truth chunks" : [ { "chunk": "## LangChain4j Library StructureLangChain4j features a modular design, comprising: - The `langchain4j-core` module, which defines core abstractions (such as `ChatLanguageModel` and `EmbeddingStore`) and their APIs … You can use the `langchain4j-{integration}` modules independently. For additional features, simply import the main `langchain4j` dependency." , }, { "chunk": "LangChain4j began development in early 2023 amid the ChatGPT hype. \n We noticed a lack of Java counterparts to the numerous Python and JavaScript LLM libraries and frameworks,\n and we had to fix that! \n Although \"LangChain\" is in our name … For easier integration, LangChain4j also includes integration with\n [Quarkus](/tutorials/quarkus-integration) and [Spring Boot](/tutorials/spring-boot-integration)." , } ], "answer": "LangChain4j is a Java library that aims to simplify integrating Language Model Models (LLMs) into Java applications. It features a modular design with different modules such as `langchain4j-core` for core abstractions, the main `langchain4j` module for useful tools, and various `langchain4j-{integration}` … The library is actively developed, with both low-level and high-level abstractions for interacting with LLMs. Additionally, LangChain4j includes integration with Quarkus and Spring Boot for easier integration." }
  27. Hand Picked vs Synthetic? Hand Picked Pros: + Better reflection

    of real world situations that the application might face if based out of real world data + Allows to map edge cases + Ensures that every possible subject is covered Cons: - Time consuming - Expensive Synthetic Pros: + Low cost + Quick to generate + Better than starting from nothing Cons: - Might not cover everything - Possible hallucinations depending on the model used - Mostly low complexity entries It is possible to mix both strategies
  28. Picking the scorers for your application Ground Truth text ---

    Here's how: 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Retrieved texts --- 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). --- --- 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Similar content but split in paragraphs
  29. Picking the scorers for your application Ground Truth text ---

    Here's how: 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Retrieved texts --- 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). --- Missing content
  30. Picking the scorers for your application Ground Truth text ---

    Here's how: 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Retrieved texts --- LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. --- --- LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. --- --- To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. --- --- LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). --- Split even further to sentences
  31. Picking the scorers for your application Expected Answer “LangChain4j is

    a Java library that aims to simplify integrating Language Model Models (LLMs) into Java applications. It features a modular design with different modules such as `langchain4j-core` for core abstractions, the main `langchain4j` module for useful tools, and various `langchain4j-{integration}` modules for integration with different LLM providers and embedding stores. The project started in early 2023 and is a fusion of ideas from various sources like LangChain, Haystack, LlamaIndex, and the broader community. The library is actively developed, with both low-level and high-level abstractions for interacting with LLMs. Additionally, LangChain4j includes integration with Quarkus and Spring Boot for easier integration.” Returned Answer “LangChain4j is a Java library designed to streamline the integration of Language Model Models (LLMs) into Java applications. It boasts a modular architecture, featuring components like langchain4j-core for core abstractions, the primary langchain4j module with useful tools, and various langchain4j-{integration} modules for seamless integration with different LLM providers and embedding stores. Launched in early 2023, the project combines ideas from sources such as LangChain, Haystack, LlamaIndex, and the broader community. The library is under active development, offering both low-level and high-level abstractions for interacting with LLMs. Moreover, LangChain4j provides integration with frameworks like Quarkus and Spring Boot, making the setup process even easier.” Semantically similar but not word for word the same
  32. Picking the scorers for your application Document Retrieval ❖ Sentence

    Level Precision Recall ➢ Extract all sentences from all text chunks to calculate Precision and Recall ❖ RougeL based Recall ➢ Use RougeL between text chunks to calculate Recall ❖ FuzzyText Precision Recall ➢ Use weighted fuzzy text search between text chunks to calculate Precision and Recall ❖ Token Level Precision Recall ➢ Extract tokens from all text chunks to calculate Precision and Recall Generation 🔨 ❖ RougeL Answer 🔨 ➢ Calculate RougeL Precision, Recall and F1 between answers ❖ Bert Answer 🔨 ➢ Use Bert based model to calculate semantic similarity between answers
  33. Evaluation Procedure 1. Setup the application 2. Determine the parameters

    we want to track, e.g. a. Size of chunks b. Chunks overlap c. Number of returned entries 3. Run through each entry in the dataset and score the application 4. Average the results 5. Change the parameters 6. Repeat from 3 7. … 8. Profit?
  34. Example evaluation Setup the retrieval part of our application public

    class PrecisionRecallEvaluationDemo { public static void main(String[] args) { //Load all documents from lc4j documentation. // List<Document> l4jDocuments = FileSystemDocumentLoader .loadDocumentsRecursively("../langchain4j/docs/docs"); //Split the documents. DocumentSplitter splitter = DocumentSplitters .recursive(1000, 0); List<TextSegment> segments = splitter.splitAll(l4jDocuments); EmbeddingModel embeddingModel = new BgeSmallEnV15QuantizedEmbeddingModel(); List<Embedding> embeddings = embeddingModel.embedAll(segments).content(); //Add them to embedding store. EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); embeddingStore.addAll(embeddings, segments);
  35. Example evaluation Setup the retrieval part of our application •

    Load the documents • Split them to chunks • Calculate embeddings • Store them public class PrecisionRecallEvaluationDemo { public static void main(String[] args) { //Load all documents from lc4j documentation. // List<Document> l4jDocuments = FileSystemDocumentLoader .loadDocumentsRecursively("../langchain4j/docs/docs"); //Split the documents. DocumentSplitter splitter = DocumentSplitters .recursive(1000, 0); List<TextSegment> segments = splitter.splitAll(l4jDocuments); EmbeddingModel embeddingModel = new BgeSmallEnV15QuantizedEmbeddingModel(); List<Embedding> embeddings = embeddingModel.embedAll(segments).content(); //Add them to embedding store. EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); embeddingStore.addAll(embeddings, segments);
  36. Example Evaluation Add the evaluator and somewhere to store the

    scores var fuzzyMatchingEvaluator = new FuzzyMatchingChunkEvaluator(); HashMap<String, Double> averageFuzzyResults = new HashMap<>();
  37. Example Evaluation Use the embedding model and pass to the

    scorer to evaluate • Get the query for the embedding • Get text chunks based on it • Score against the ground truth chunks of text • Add results to calculate averages later for (DatasetEntry entry: Dataset.get()){ var queryEmbedding = embeddingModel.embed(entry.query()).content(); var searchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(queryEmbedding).maxResults(5).build(); var searchResult = embeddingStore.search(searchRequest); var fuzzyResults = fuzzyMatchingEvaluator.evaluate( entry.expectedContextResults(), searchResult.matches().stream() .map(EmbeddingMatch::embedded).toList()); for (String key: fuzzyResults.keySet()){ averageFuzzyResults.put( key, averageFuzzyResults.getOrDefault(key, 0.0) + fuzzyResults.get(key)); } }
  38. Example Evaluation Calculate the average score from the results before

    System.out.println("Average fuzzy results:" ); for (String key: averageFuzzyResults.keySet()){ System.out.println(key + ": " + averageFuzzyResults.get(key) / Dataset. get().size()); }
  39. Example Evaluation Running the application Fraction of retrieved items that

    are relevant Higher Recall -> More context that we want is actually retrieved from the store (so it should be prioritized at first) Fraction of relevant items that are actually retrieved
  40. Key takeaways ❖ Evaluations are only as good as the

    data they use (garbage-in / garbage-out) ❖ 100% on every metric might not be possible ❖ Pick on what aspects you want to improve on and focus on them ❖ For some pairs of metrics maximising for one metric can sometimes negatively influence another ❖ The Dataset will change as the application evolves (eg. changes to text corpus, new scenarios to add)
  41. ❖ Evals demo project - https://github.com/Dkafetzis/evals-demo ❖ Quarkus: Testing AI-Infused

    Applications - https://docs.quarkiverse.io/quarkus-langchain4j/dev/testing.html# ❖ Creating a LLM-as-a-Judge That Drives Business Results - https://hamel.dev/blog/posts/llm-judge/ ❖ Your AI Product Needs Evals - https://hamel.dev/blog/posts/evals/ ❖ A Field Guide to Rapidly Improving AI Products - https://hamel.dev/blog/posts/field-guide/ ❖ Ragas Evaluation Dataset - https://docs.ragas.io/en/stable/concepts/components/eval_dataset/ ❖ Emerging Patterns in Building GenAI Products - https://martinfowler.com/articles/gen-ai-patterns/ References