Bucharest Tech Week 2026 - Reinventing testing practices in the AI era

@edeandrea

@edeandrea Because we are not data scientists We integrate existing
models Do you really want to do these in Python? • Transactions • Security • Scalability • Observability into enterprise- grade systems and applications Java??? 😯 … no seriously … why not Python? 🤔

@edeandrea I don’t care if it works on your Jupyter
notebook We are not shipping your Jupyter notebook

@edeandrea • Java Champion • 27+ years software development experience
• Works on Open Source projects Quarkus LangChain4j, Quarkus LangChain4j Docling Java Langfuse Java, Quarkus Langfuse Spring Boot, Spring Framework, Spring Security Testcontainers Wiremock Microcks • Boston Java Users ACM Chapter Vice Chair & Board Member • Published Author • Cat lover Who am I?

@edeandrea • Showcase & explain Quarkus, how it enables modern
Java development & the Kubernetes- native experience • Introduce familiar Spring concepts, constructs, & conventions and how they map to Quarkus • Equivalent code examples between Quarkus and Spring as well as emphasis on testing patterns & practices 9 https://red.ht/quarkus-spring-devs

@edeandrea What are you hoping to learn here? What are
you hoping to learn here? What are you going to leave with?

@edeandrea What’s happening in industry? • Standardization ◦ Or lack
thereof (lots of competing standards)? • Distributed • Orchestrated • Agentic • Agents • Agentic Agents • Autonomous Agents • Autonomous Agentic Agents Smells like microservices?

@edeandrea

@edeandrea How does your DevOps evolve when you infuse your
applications with AI?

@edeandrea

@edeandrea DevOps Evolution Dev Ops Release Deploy Operate Monitor Plan
Code Build Test Train Evaluate Deploy Collect Evaluate Curate Analyze Data ML

@edeandrea Application Database Application Service CRUD application Microservice Application Model
AI-Infused application What’s the difference between these?

AI-Infused application Integration Points What’s the difference between these?

AI-Infused application Integration Points What’s the difference between these? What do we do?

AI-Infused application Integration Points What’s the difference between these?

@edeandrea @edeandrea end-to-end tests unit tests integration tests low effort
high realism high value

@edeandrea Testing AI Interactions https://youtu.be/S3XFFeTnILM https://github.com/edeandrea/non-deterministic-no-problem/tree/langfuse/parasol-app

@edeandrea Testing AI Interactions 25

@edeandrea https://library.wiremock.org/catalog/api/o/openai.com/openai-com https://mockgpt.wiremock.io https://docs.quarkiverse.io/quarkus-wiremock/dev

AI-Infused application Integration Points Observability (metrics, tracing, logs, auditing) Fault Tolerance (timeout, bulkhead, circuit breaker, rate limiting, fallbacks, …) What’s the difference between these?

@edeandrea This isn’t the answer!

@edeandrea @edeandrea Is that enough?

@edeandrea What does failure look like? What do we need
to do differently?

@edeandrea https://www.upworthy.com/prankster-tricks-a-gm-dealership-chatbot-to-sell-him-a-76000-chevy-tahoe-for-1-rp3 https://www.cbsnews.com/news/aircanada-chatbot-discount-customer https://www.bbc.com/news/technology-35902104 https://www.spiceworks.com/tech/artificial-intelligence/news/meta-blender-bot-3-controversy https://www.linkedin.com/posts/stephanjanssen_princoming-activity-7285987635628507136-9Ubw

@edeandrea What happens when we do this?

@edeandrea What happens when we do this? Is this a
failure?

@edeandrea @edeandrea end-to-end tests unit tests integration tests low effort
high realism tests with application server test REST endpoints tests using AI

@edeandrea Observability Collect metrics - Exposed as Prometheus - Track
token usage & cost OpenTelemetry Tracing - Trace interactions with the LLM Auditing - Track of interactions with the LLM - Ability to replay & re-score interactions Continuous evaluation - Evaluate interactions in real time

@edeandrea @edeandrea Re-Evaluations

@edeandrea Rescoring - Evaluation https://docs.quarkiverse.io/quarkus-langchain4j/dev/testing.html#_evaluation 1. Sample ◦ The test
case containing input parameters & expected output. 2. Function under test ◦ The function being evaluated. Receives input parameters & produces and actual output. 3. Evaluation Strategy ◦ Logic that determines if the actual output is acceptable based on the expected output. 4. Evaluation Result ◦ Outcome (pass/fail), score, explanation, and metadata from the evaluation

@edeandrea @edeandrea Takeaways

@edeandrea • Naming things is still the hardest thing in
computer science • Java is still relevant • Remember the testing pyramid! Use appropriate tools at each level! • LangChain4j & Quarkus are awesome! They provide foundational building blocks! • Don’t build observability into your apps - build it around your apps • Test in production! • Write tests, expect change and failure, deploy often • AI is just an API call Actual takeaways

@edeandrea @edeandrea Thank You! Slides https://bit.ly/ro-ai-testing

Bucharest Tech Week 2026 - Reinventing testing ...

Bucharest Tech Week 2026 - Reinventing testing practices in the AI era

More Decks by Eric Deandrea

Other Decks in Technology

Featured

Transcript