A look inside the LLM closed box: test, observe and evaluate your RAG assisted chatbot

A look inside the LLM closed box: test, observe and
evaluate your RAG assisted chatbot by Mario Fusco & Dimitris Kafetzis

What Generative AI can do is amazing …

… well … almost always :) …

… but … there’s an elephant in the room How
can you write reliable and deterministic tests for something that works on a statistical basis???

Where does the problem come from?

An hallucination is an inconsistency and it can happen at
different levels: - Inconsistency within output sentences - “Mario is tall thus he is the shortest person” - Inconsistency between input and output - “Generate formal text to announce to colleagues …” “Yo boyz!” - Factually wrong - “First man on the Moon in 2024” What hallucinations are

A LLM is a black box able to hallucinate. This
because of multiple reasons: - Partial/inconsistent training data - LLMs learn how to generalize from training data assuming they are comprehensive (but we don’t train them!) - Configuration of generation can be “hallucination prone” - Sampling configurations like temperature, top_k, top_p guide creativity (but we often want LLM to be creative) - Context/input quality - The more is specific, the better (we can control this!) Why hallucinations happen

An unprecedented definition of malfunctioning Given their statistical nature, GenAI
algorithms can hallucinate or make other spectacular mistakes even when they work very well. These mistakes cannot be fixed in the same way of a software bug.

Why not fixing LLM’s hallucinations using another LLM to validate
the outcome of the first?

Why not fixing LLM’s hallucinations using another LLM to validate
the outcome of the first? … this reminds me of a 20 years old joke …

LLM to judge another (or even the same) LLM sounds
wrong but it is a useful tool together with Humans. It is critical to ask the right questions to detect many issues/hallucinations (but usually not factual checking): - Style/Tone: “is the response using formal language?” - Bias: “does the response show prejudice against a group?” - Sentiment: “has the text a positive sentiment?” - Consistency: “does the response stick to the provided context?” LLM as a judge

Why don’t normal tests work? What do we need to
do differently?

Signal from tests: ❌ - stuff needs fixing ✅ -
confident to release

Signal from tests: ❌ - stuff needs fixing ✅ -
confident to release Purpose of tests: ❌ - prevent breaking prod ✅ - continuously improve your app

Testing Applications Whenever we build a software system, we need
to ensure that it behaves in a way that matches our intentions. With traditional systems, we do this primarily through testing. We provide a thoughtfully selected sample of input, and verified that the system responds in the way we expect.

Testing AI-Infused Applications With LLM-based systems, we encounter a system
that no longer behaves deterministically. Such a system will provide different outputs to the same inputs on repeated requests. This doesn't mean we cannot examine its behavior to ensure it matches our intentions, but it does mean we have to think about it differently.

Come on … how hard it can be? :) @ParameterizedTest
@MethodSource("modelsSupportingImageInputs") @EnabledIf("supportsSingleImageInputAsPublicURL") protected void should_accept_single_image_as_public_URL(M model) { // given UserMessage userMessage = UserMessage.from( TextContent.from("What do you see?"), ImageContent.from(CAT_IMAGE_URL) ); ChatRequest chatRequest = ChatRequest.builder() .messages(userMessage) .build(); // when ChatResponse chatResponse = chat(model, chatRequest).chatResponse(); // then AiMessage aiMessage = chatResponse.aiMessage(); assertThat(aiMessage.text()).containsIgnoringCase("cat"); assertThat(aiMessage.toolExecutionRequests()).isNull(); } https://upload.wikimedia.org/wikipedia/ commons/e/e9/Felis_silvestris_silvestri s_small_gradual_decrease_of_quality.png

Come on … how hard it can be? :) @ParameterizedTest
@MethodSource("modelsSupportingImageInputs") @EnabledIf("supportsSingleImageInputAsPublicURL") protected void should_accept_single_image_as_public_URL(M model) { // given UserMessage userMessage = UserMessage.from( TextContent.from("What do you see?"), ImageContent.from(CAT_IMAGE_URL) ); ChatRequest chatRequest = ChatRequest.builder() .messages(userMessage) .build(); // when ChatResponse chatResponse = chat(model, chatRequest).chatResponse(); // then AiMessage aiMessage = chatResponse.aiMessage(); assertThat(aiMessage.text()).containsIgnoringCase("cat"); assertThat(aiMessage.toolExecutionRequests()).isNull(); } https://upload.wikimedia.org/wikipedia/ commons/e/e9/Felis_silvestris_silvestri s_small_gradual_decrease_of_quality.png [ERROR] Failures: [ERROR] OpenAiChatModelIT>AbstractBaseChatModelIT.should_ accept_single_image_as_public_URL:1131 Expecting actual: "I see an animal with a feline appearance. It has a distinctive striped coat and green eyes. The background appears blurred or out of focus." to contain: "cat" (ignoring case)

A simple AI-infused app @RegisterAiService public interface CategoryRouter { @UserMessage("""
Analyze the user request and categorize It as 'legal', 'medical' or 'technical'. Reply with only one of those words. The user request is: '{request}'. """) RequestCategory classify(String request); } public enum RequestCategory { LEGAL, MEDICAL, TECHNICAL, UNKNOWN } @GET @Path("request/{request}") @Produces(MediaType.TEXT_PLAIN) public String assist(String request) { return routerService.findExpert(request).apply(request); } public UnaryOperator<String> findExpert(String request) { return switch (categoryRouter.classify(request)) { case LEGAL -> legalExpert::chat; case MEDICAL -> medicalExpert::chat; case TECHNICAL -> technicalExpert::chat; default -> ignore -> "I cannot find an appropriate category."; }; }

… and how to test it @QuarkusTest @AiScorer public class
CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency>

CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The Scorer is a utility

CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The Scorer is a utility that evaluates a set of samples

CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The Scorer is a utility that evaluates a set of samples against a function

CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples<RequestCategory> samples = new Samples<>( EvaluationSample.<RequestCategory>builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.<RequestCategory>builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The Scorer is a utility that evaluates a set of samples against a function and a set of evaluation strategies public interface EvaluationStrategy<T> { boolean evaluate(EvaluationSample<T> sample, T output); }

Built-in Evaluation Strategies new SemanticSimilarityStrategy(embeddingModel, 0.85); <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing- scorer-semantic-similarity</artifactId>
<version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The SemanticSimilarityStrategy evaluates the similarity between the actual output and the expected output using cosine similarity. It requires an embedding model and a minimum similarity threshold.

Built-in Evaluation Strategies new SemanticSimilarityStrategy(embeddingModel, 0.85); <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing- scorer-semantic-similarity</artifactId>
<version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> The SemanticSimilarityStrategy evaluates the similarity between the actual output and the expected output using cosine similarity. It requires an embedding model and a minimum similarity threshold. <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-testing- scorer-ai-judge</artifactId> <version>${quarkus.langchain4j.version}</version> <scope>test</scope> </dependency> EvaluationStrategy<String> strategy = new AiJudgeStrategy(myModel, """ You are an AI evaluating a response and the expected output. You need to evaluate if the model response is correct or not. Return true if the response is correct, false otherwise. Response to evaluate: {response} Expected output: {expected_output} """); The AiJudgeStrategy (LLM-as-a-judge) leverages an AI model to determine if the actual output matches the expected output.

LangChain4j evals Toward a comprehensive LLM evaluation framework

What is RAG? ❖ Retrieve information from a database ❖
Augment user query with retrieved data ❖ Generate content from the augmented context

What is RAG and what does it mean to evaluate
it? ❖ Retrieve information from a database ❖ Augment user query with retrieved data ❖ Generate content from the augmented context ❖ Retrieval evaluation ➢ Are the retrieved documents relevant? (Precision) ➢ How many of the relevant documents are retrieved? (Recall) ❖ Generation evaluation ➢ What is the quality of the final generated output?

A practical example

Parts of an evaluation ❖ Scope of evaluation and Dataset
creation ❖ Scorers ❖ Evaluator

What is a dataset? ❖ A curated set of data
entries used to evaluate aspects of the application ❖ Usually the more we have the better, but quality is important Question What is the capital of Germany? Retrieved Contexts "Berlin is the capital and largest city of Germany." Ground Truth Answer Berlin

How to make a dataset ❖ Define ➢ aspects of
the application you want to evaluate ➢ possible scenarios you want to test ➢ what data is needed to test them ❖ Collect samples of the data that best represent the scenarios from ➢ the real world ➢ through synthetic generation ❖ Aim for a dataset that is ➢ large enough to provide meaningful insights ➢ not so large that it becomes unwieldy ❖ Ensure that the data ➢ is of high quality ➢ accurately reflects the scenarios under evaluation

How to make a dataset (Example)

How to make a dataset (initial log entry) { "insertId":
"rroswqf9rk00w", "jsonPayload": { "message": "1a3efd1b-6aa2-41e2-92c2-56f58471acf3;;;a9a840e2-521c-4b3b-82a0-80bee1ca8122;;;question;;;If I get Input is too long for requested model when sending data to sonnet 3.5, how can I manage this? split the text and do more than call or how ?" }, … }, "timestamp": "2024-10-14T07:40:44.646Z", "severity": "INFO", "labels": { "loggerName": "com.vaadin.demo.views.ChatView", "levelValue": "20000", "levelName": "INFO" }, "logName": "projects/langchain4j/logs/application-log", "receiveTimestamp": "2024-10-14T07:40:44.833940511Z" } … { "insertId": "8d5lksf9xmvie", "jsonPayload": { "message": "1a3efd1b-6aa2-41e2-92c2-56f58471acf3;;;retrieved;;;tutorials\\1-chat-and-language-models.md;7;;integrations\\language-models\\jlama.md;7;;tutorials\\6-tools.md;6" }, … }, { "insertId": "3bya7df19n47j", "jsonPayload": { "message": "1a3efd1b-6aa2-41e2-92c2-56f58471acf3;;;a9a840e2-521c-4b3b-82a0-80bee1ca8122;;;answer;;;If you encounter the error \"Input is too long for requested model\" when sending data to the Sonnet 3.5 model, you can manage this by splitting the text into smaller chunks and making multiple calls to the model. Since LLMs do not maintain the state of the conversation, you'll need to keep track of the context yourself.\n\nHere’s how you can handle it:\n\n1. **Split the Text:** Break your input into smaller, manageable pieces that fit within the model's input limits.\n\n2. **Maintain Context:** When making multiple calls, ensure you keep the relevant context from previous interactions. This may involve including previous user and AI messages in each new request.\n\n3. **Use ÙserMessage`:** Create a ÙserMessage` for each chunk of text. You can use `new UserMessage(\"Your text here\")` or ÙserMessage.from(\"Your text here\")` to do this.\n\n4. **Call the Model:** For each chunk, generate a response using the model, like so:\n ```java\n UserMessage userMessage = UserMessage.from(\"Your chunk of text here\");\n Response<AiMessage> response = model.generate(singletonList(userMessage), toolSpecifications);\n AiMessage aiMessage = response.content();\n ```\n\n5. **Combine Responses:** After processing all chunks, combine the responses to form a complete output.\n\nBy following these steps, you can effectively manage long inputs and maintain a coherent conversation with the model." }, … }

How to make a dataset (cleaned up DataSet entry) {
"query": "What is langchain4j" , "ground truth chunks" : [ { "chunk": "## LangChain4j Library StructureLangChain4j features a modular design, comprising: - The `langchain4j-core` module, which defines core abstractions (such as `ChatLanguageModel` and `EmbeddingStore`) and their APIs … You can use the `langchain4j-{integration}` modules independently. For additional features, simply import the main `langchain4j` dependency." , }, { "chunk": "LangChain4j began development in early 2023 amid the ChatGPT hype. \n We noticed a lack of Java counterparts to the numerous Python and JavaScript LLM libraries and frameworks,\n and we had to fix that! \n Although \"LangChain\" is in our name … For easier integration, LangChain4j also includes integration with\n [Quarkus](/tutorials/quarkus-integration) and [Spring Boot](/tutorials/spring-boot-integration)." , } ], "answer": "LangChain4j is a Java library that aims to simplify integrating Language Model Models (LLMs) into Java applications. It features a modular design with different modules such as `langchain4j-core` for core abstractions, the main `langchain4j` module for useful tools, and various `langchain4j-{integration}` … The library is actively developed, with both low-level and high-level abstractions for interacting with LLMs. Additionally, LangChain4j includes integration with Quarkus and Spring Boot for easier integration." }

Hand Picked vs Synthetic? Hand Picked Pros: + Better reflection
of real world situations that the application might face if based out of real world data + Allows to map edge cases + Ensures that every possible subject is covered Cons: - Time consuming - Expensive Synthetic Pros: + Low cost + Quick to generate + Better than starting from nothing Cons: - Might not cover everything - Possible hallucinations depending on the model used - Mostly low complexity entries It is possible to mix both strategies

Picking the scorers for your application Ground Truth text ---
Here's how: 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Retrieved texts --- 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). --- --- 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Similar content but split in paragraphs

Here's how: 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Retrieved texts --- 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). --- Missing content

Here's how: 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Retrieved texts --- LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. --- --- LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. --- --- To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. --- --- LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). --- Split even further to sentences

Picking the scorers for your application Expected Answer “LangChain4j is
a Java library that aims to simplify integrating Language Model Models (LLMs) into Java applications. It features a modular design with different modules such as `langchain4j-core` for core abstractions, the main `langchain4j` module for useful tools, and various `langchain4j-{integration}` modules for integration with different LLM providers and embedding stores. The project started in early 2023 and is a fusion of ideas from various sources like LangChain, Haystack, LlamaIndex, and the broader community. The library is actively developed, with both low-level and high-level abstractions for interacting with LLMs. Additionally, LangChain4j includes integration with Quarkus and Spring Boot for easier integration.” Returned Answer “LangChain4j is a Java library designed to streamline the integration of Language Model Models (LLMs) into Java applications. It boasts a modular architecture, featuring components like langchain4j-core for core abstractions, the primary langchain4j module with useful tools, and various langchain4j-{integration} modules for seamless integration with different LLM providers and embedding stores. Launched in early 2023, the project combines ideas from sources such as LangChain, Haystack, LlamaIndex, and the broader community. The library is under active development, offering both low-level and high-level abstractions for interacting with LLMs. Moreover, LangChain4j provides integration with frameworks like Quarkus and Spring Boot, making the setup process even easier.” Semantically similar but not word for word the same

Picking the scorers for your application Document Retrieval ❖ Sentence
Level Precision Recall ➢ Extract all sentences from all text chunks to calculate Precision and Recall ❖ RougeL based Recall ➢ Use RougeL between text chunks to calculate Recall ❖ FuzzyText Precision Recall ➢ Use weighted fuzzy text search between text chunks to calculate Precision and Recall ❖ Token Level Precision Recall ➢ Extract tokens from all text chunks to calculate Precision and Recall Generation 🔨 ❖ RougeL Answer 🔨 ➢ Calculate RougeL Precision, Recall and F1 between answers ❖ Bert Answer 🔨 ➢ Use Bert based model to calculate semantic similarity between answers

Evaluation Procedure 1. Setup the application 2. Determine the parameters
we want to track, e.g. a. Size of chunks b. Chunks overlap c. Number of returned entries 3. Run through each entry in the dataset and score the application 4. Average the results 5. Change the parameters 6. Repeat from 3 7. … 8. Profit?

Example evaluation Setup the retrieval part of our application public
class PrecisionRecallEvaluationDemo { public static void main(String[] args) { //Load all documents from lc4j documentation. // List<Document> l4jDocuments = FileSystemDocumentLoader .loadDocumentsRecursively("../langchain4j/docs/docs"); //Split the documents. DocumentSplitter splitter = DocumentSplitters .recursive(1000, 0); List<TextSegment> segments = splitter.splitAll(l4jDocuments); EmbeddingModel embeddingModel = new BgeSmallEnV15QuantizedEmbeddingModel(); List<Embedding> embeddings = embeddingModel.embedAll(segments).content(); //Add them to embedding store. EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); embeddingStore.addAll(embeddings, segments);

Example evaluation Setup the retrieval part of our application •
Load the documents • Split them to chunks • Calculate embeddings • Store them public class PrecisionRecallEvaluationDemo { public static void main(String[] args) { //Load all documents from lc4j documentation. // List<Document> l4jDocuments = FileSystemDocumentLoader .loadDocumentsRecursively("../langchain4j/docs/docs"); //Split the documents. DocumentSplitter splitter = DocumentSplitters .recursive(1000, 0); List<TextSegment> segments = splitter.splitAll(l4jDocuments); EmbeddingModel embeddingModel = new BgeSmallEnV15QuantizedEmbeddingModel(); List<Embedding> embeddings = embeddingModel.embedAll(segments).content(); //Add them to embedding store. EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); embeddingStore.addAll(embeddings, segments);

Example Evaluation Add the evaluator and somewhere to store the
scores var fuzzyMatchingEvaluator = new FuzzyMatchingChunkEvaluator(); HashMap<String, Double> averageFuzzyResults = new HashMap<>();

Example Evaluation Use the embedding model and pass to the
scorer to evaluate • Get the query for the embedding • Get text chunks based on it • Score against the ground truth chunks of text • Add results to calculate averages later for (DatasetEntry entry: Dataset.get()){ var queryEmbedding = embeddingModel.embed(entry.query()).content(); var searchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(queryEmbedding).maxResults(5).build(); var searchResult = embeddingStore.search(searchRequest); var fuzzyResults = fuzzyMatchingEvaluator.evaluate( entry.expectedContextResults(), searchResult.matches().stream() .map(EmbeddingMatch::embedded).toList()); for (String key: fuzzyResults.keySet()){ averageFuzzyResults.put( key, averageFuzzyResults.getOrDefault(key, 0.0) + fuzzyResults.get(key)); } }

Example Evaluation Calculate the average score from the results before
System.out.println("Average fuzzy results:" ); for (String key: averageFuzzyResults.keySet()){ System.out.println(key + ": " + averageFuzzyResults.get(key) / Dataset. get().size()); }

DEMO TIME !!!

Example Evaluation Running the application

Example Evaluation Running the application Fraction of retrieved items that
are relevant Higher Recall -> More context that we want is actually retrieved from the store (so it should be prioritized at first) Fraction of relevant items that are actually retrieved

Results of the evaluators

Key takeaways ❖ Evaluations are only as good as the
data they use (garbage-in / garbage-out) ❖ 100% on every metric might not be possible ❖ Pick on what aspects you want to improve on and focus on them ❖ For some pairs of metrics maximising for one metric can sometimes negatively influence another ❖ The Dataset will change as the application evolves (eg. changes to text corpus, new scenarios to add)

❖ Evals demo project - https://github.com/Dkafetzis/evals-demo ❖ Quarkus: Testing AI-Infused
Applications - https://docs.quarkiverse.io/quarkus-langchain4j/dev/testing.html# ❖ Creating a LLM-as-a-Judge That Drives Business Results - https://hamel.dev/blog/posts/llm-judge/ ❖ Your AI Product Needs Evals - https://hamel.dev/blog/posts/evals/ ❖ A Field Guide to Rapidly Improving AI Products - https://hamel.dev/blog/posts/field-guide/ ❖ Ragas Evaluation Dataset - https://docs.ragas.io/en/stable/concepts/components/eval_dataset/ ❖ Emerging Patterns in Building GenAI Products - https://martinfowler.com/articles/gen-ai-patterns/ References

A look inside the LLM closed box: test, observe...

A look inside the LLM closed box: test, observe and evaluate your RAG assisted chatbot

More Decks by Mario Fusco

Featured

Transcript