Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Harness for Behaviour: how to get AI to gener...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

A Harness for Behaviour: how to get AI to generate code that does what we intend, or "TDD in the age of AI"

This was my presentation at Platmosphere

Avatar for Matteo Vaccari

Matteo Vaccari

May 27, 2026

More Decks by Matteo Vaccari

Other Decks in Technology

Transcript

  1. © 2026 Thoughtworks | Confidential But quality is declining Nearly

    30% of all merged code is AI-generated. While throughput is up, some teams face a 50% increase in defects AI-generated code breaks more often and takes longer to fix: branch success rates dropped to 70.8%, the lowest in over five years CircleCI 2026 State of Software Delivery DX AI-assisted engineering: Q1 impact report Faros AI Engineering Report 2026 • 51% PR Size • 28% Bugs per PR • 5X Median Review Time • 3X Incidents per PR • 10X Code Churn Cloudbeesʼ State of Code Abundance 2026 Increased production incidents reported by 81% respondents
  2. © 2026 Thoughtworks | Confidential Problem statement 4 At the

    moment, most people who give high autonomy to their coding agents do this: • A functional specification • Check if: ◦ The AI-generated test suite is green, ◦ Has reasonably high coverage, ◦ Maybe monitor test quality with mutation testing ◦ Then do manual testing Birgitta Böckeler, Distinguished Engineer, Thoughtworks … Is this enough?
  3. © 2026 Thoughtworks | Confidential At the moment, most people

    who give high autonomy to their coding agents do this: • A functional specification • Check if: ◦ The AI-generated test suite is green, ◦ Has reasonably high coverage, ◦ Maybe monitor test quality with mutation testing ◦ Then do manual testing 😱 Problem statement 5 Birgitta Böckeler, Distinguished Engineer, Thoughtworks … Is this enough?
  4. © 2026 Thoughtworks | Confidential About me • 1998: PhD

    in Formal Methods • 2002: “Discoveredˮ Extreme Programming • 2007-2014: XP coach • 2015-present: Technical Principal @ Thoughtworks • 2025-present: AI-assisted developer @ Thoughtworks 6
  5. © 2026 Thoughtworks | Confidential The old TDD playbook 8

    Kent Beck invented TDD in 1999 Very effective for manual development https://martinfowler.com/bliki/TestDrivenDevelopment.html
  6. © 2026 Thoughtworks | Confidential AI does not like the

    TDD rules 9 • AI wants to write tests after the implementation • AI wants to write all the tests at once You can force the AI to follow the strict TDD process, but it takes time and tokens. Is it worth it?
  7. © 2026 Thoughtworks | Confidential AI tests cannot be trusted

    10 • AI likes to write tests against implementations • Then AI changes both tests and implementation, destroying our confidence that tests are protecting us Do you trust our tests? Well, at times the AI changes both tests and code, and then…
  8. © 2026 Thoughtworks | Confidential Tests coupled to the implementation

    11 Do you trust our tests? Well, at times the AI changes both tests and code, and then… What I care about: • at checkout, • generate backorders • for products that are low on inventory What the AI tests • verify order creation with correct parameters • verify checkInventory called for each cart item • verify returns “orderdoneˮ Ecommerce app The AI tests that • method A calls method B • method B calls method C • etc mock
  9. © 2026 Thoughtworks | Confidential The volume of generated tests

    is a problem 12 Reading generated tests is even harder than reading generated code
  10. © 2026 Thoughtworks | Confidential We are all still figuring

    this out. This presentation will be different next month 13
  11. © 2026 Thoughtworks | Confidential Automated tests are even more

    necessary Nearly 30% of all merged code is AI-generated. While throughput is up, some teams face a 50% increase in defects AI-generated code breaks more often and takes longer to fix: branch success rates dropped to 70.8%, the lowest in over five years CircleCI 2026 State of Software Delivery DX AI-assisted engineering: Q1 impact report Faros AI Engineering Report 2026 • 51% PR Size • 28% Bugs per PR • 5X Median Review Time • 3X Incidents per PR • 10X Code Churn Cloudbeesʼ State of Code Abundance 2026 Increased production incidents reported by 81% respondents
  12. © 2026 Thoughtworks | Confidential 16 A spec without tests

    is not a spec The old Extreme Programming playbook recommends writing examples of the expected behaviour. We call them acceptance tests
  13. © 2026 Thoughtworks | Confidential Customer tests to the rescue

    Customer tests, aka Acceptance Tests, are a practice of Extreme programming 17
  14. © 2026 Thoughtworks | Confidential Test desiderata 18 AI-generated unit

    tests Customer tests Predict success in production No Yes Fast Yes It depends Support refactoring No Yes Low total cost of ownership No Yes See Emily Bache, Test Desiderata 2.0
  15. © 2026 Thoughtworks | Confidential Problem: tests are hard to

    read 20 @Test void defaultGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, world!", response.body()); } @Test void personalisedGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello?name=Joe")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, Joe!", response.body()); }
  16. © 2026 Thoughtworks | Confidential Solution: focus on what matters

    21 @Test void defaultGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, world!", response.body()); } @Test void personalisedGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello?name=Joe")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, Joe!", response.body()); }
  17. © 2026 Thoughtworks | Confidential Solution: focus on what matters

    22 @Test void defaultGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, world!", response.body()); } @Test void personalisedGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello?name=Joe")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, Joe!", response.body()); } /hello → "Hello, world!" /hello?name=Joe → "Hello, Joe!" ☝
  18. © 2026 Thoughtworks | Confidential The new playbook 23 -

    description: default greeting request: path: /hello expected_response: body: "Hello, world!" - description: personalised greeting request: path: /hello query: name: Joe expected_response: body: "Hello, Joe!" • Most test should be tabular • Most test tables should be text files Process: 1. The AI builds the test runner 2. The AI writes the first draft of the test table
  19. © 2026 Thoughtworks | Confidential The new playbook 24 -

    description: default greeting request: path: /hello expected_response: body: "Hello, world!" - description: personalised greeting request: path: /hello query: name: Joe expected_response: body: "Hello, Joe!" - description: name parameter empty request: path: /hello query: name: "" expected_response: body: "Hello, stranger!" • Most test should be tabular • Most test tables should be text files Process: 1. The AI builds the test runner 2. The AI writes the first draft of the test table 3. You check and integrate the table
  20. © 2026 Thoughtworks | Confidential The idea is not new:

    Go 25 var fmtTests = []struct { fmt string val any out string }{ {"%d", 12345, "12345"}, {"%v", 12345, "12345"}, {"%t", true, "true"}, // basic string {"%s", "abc", "abc"}, {"%q", "abc", `"abc"`}, {"%x", "abc", "616263"}, {"%x", "\xff\xf0\x0f\xff", "fff00fff"}, {"%X", "\xff\xf0\x0f\xff", "FFF00FFF"}, {"%x", "", ""}, {"% x", "", ""}, {"%#x", "", ""}, {"%# x", "", ""}, {"%x", "xyz", "78797a"}, {"%X", "xyz", "78797A"}, {"% x", "xyz", "78 79 7a"}, {"% X", "xyz", "78 79 7A"}, {"%#x", "xyz", "0x78797a"}, {"%#X", "xyz", "0X78797A"}, {"%# x", "xyz", "0x78 0x79 0x7a"}, {"%# X", "xyz", "0X78 0X79 0X7A"}, In Go, most tests are tabular These are tests for the “fmtˮ function {"%q", "abc", `"abc"`}, ^ format ^ input ^ expected
  21. © 2026 Thoughtworks | Confidential 26 The approach is similar

    to Ward Cunninghamʼs Fit, that interprets doc tables as tests http://fit.c2.com/ The idea is not new: Fit
  22. © 2026 Thoughtworks | Confidential > Premium customers get free

    shipping for large orders except for oversized items. | Customer | Order Total | Contains Oversized Item | Expected Shipping | | -------- | ----------- | ----------------------- | ----------------- | | Standard | €40 | No | €6.99 | | Standard | €120 | No | €0.00 | | Premium | €40 | No | €0.00 | | Premium | €40 | Yes | €14.99 | | Premium | €120 | Yes | €14.99 | The idea is not new: Specification by example 27 BDD and Specification by example are about tests as communication
  23. © 2026 Thoughtworks | Confidential 28 A real-life backend test…

    - id: "custom reports available to buyer" request: method_and_path: "GET /api/report/custom-reports/" auth_email: "[email protected]" headers: { "x-Role": "buyer" } response: status: 200 json: count: 1 data: - ID: "custom-{{RANDOM_ID}}" NAME: "Custom" <IGNORE_EXTRA_KEYS>: true - id: "custom reports hidden to e2e" request: method_and_path: "GET /api/report/custom-reports/" auth_email: "[email protected]" headers: { "x-Role": "e2e" } response: status: 403 json: error: "Forbidden" message: "Report non disponibili" Test DB HTTP API Backend
  24. © 2026 Thoughtworks | Confidential 29 name: "Audit log page"

    start: url: "/admin/audit-log" stubs: - method: "GET" path: "/api/admin/audit-log" fixture: "audit-log.json" steps: # Wait for table to load - action: "waitFor" target: { text: "Creato nuovo mondo" } # Table is sorted by date - action: "assert" assertions: - type: "contents" target: { role: "table" } expected_contents: - ["When", "Action", "Detail", "Author"] - ["12/05/2026 08:51:04", "CREATE", "Created 'new page'", "[email protected]"] - ["13/05/2026 09:51:04", "UPDATE", "Updated 'new page'", "[email protected]"] - ["14/05/2026 10:51:04", "DELETE", "Deleted 'new page'", "[email protected]"] - action: "click" target: { role: "button", name: "Sort by Action" } # Now table is sorted by action - action: "assert" assertions: - type: "contents" target: { role: "table" } expected_contents: - ["When", "Action", "Detail", "Author"] - ["12/05/2026 08:51:04", "CREATE", "Created 'new page'", "[email protected]"] - ["14/05/2026 10:51:04", "DELETE", "Deleted 'new page'", "[email protected]"] - ["13/05/2026 09:51:04", "UPDATE", "Updated 'new page'", "[email protected]"] …and a frontend test Test Stubbed API jsdom Frontend
  25. © 2026 Thoughtworks | Confidential Some things are hard to

    explain 30 Excerpt from the Memoir ‘44 rules
  26. © 2026 Thoughtworks | Confidential 💡 Invent a testing mini-language

    31 - name: line of sight blocked by woods board: - ". . . . . " - " . B . . ." - "A . W C . " - " . . . . ." units: - id: A side: Allies type: Infantry - id: B side: Axis type: Infantry terrain: Woods - id: C side: Axis type: Infantry assert_available_actions: - Battle A to B
  27. © 2026 Thoughtworks | Confidential 32 Actually I copied this

    from Ivett See Ivett Ördögʼs Approved Scenarios pattern and video
  28. © 2026 Thoughtworks | Confidential The core idea 1. Tests

    should express what you care about 34 …so that you can understand them
  29. © 2026 Thoughtworks | Confidential The core idea 1. Tests

    should express what you care about 2. Tests should be tabular 35 …so that you can understand them …so that itʼs easy to add tests
  30. © 2026 Thoughtworks | Confidential The core idea 1. Tests

    should express what you care about 2. Tests should be tabular 3. Tests should be data not code 36 …so that you can understand them …so that itʼs easy to add tests so that I can easily spot when AI is changing them
  31. © 2026 Thoughtworks | Confidential The core idea 1. Tests

    should express what you care about 2. Tests should be tabular 3. Tests should be data not code 4. Black box testing 37 …so that you can understand them …so that itʼs easy to add tests so that I can easily spot when AI is changing them so that the tests are not coupled to the implementation
  32. © 2026 Thoughtworks | Confidential What about the test pyramid?

    39 Unit tests Service tests API tests GUI tests Slow and expensive Fast and cheap
  33. © 2026 Thoughtworks | Confidential Unit tests still have a

    place 40 Component A Component B Component C Test We want to test interesting behaviour of C
  34. © 2026 Thoughtworks | Confidential Unit tests still have a

    place 41 Component A Component B Component C Test But we must “persuadeˮ A and B to set up C the way we want We want to test interesting behaviour of C
  35. © 2026 Thoughtworks | Confidential Unit tests still have a

    place 42 Component A Component B Component C Test But we must “persuadeˮ A and B to set up C the way we want We want to test interesting behaviour of C The “build a ship in a bottleˮ effect
  36. © 2026 Thoughtworks | Confidential Customer tests implemented as unit

    tests 43 tests: - name: "is_ip_allowed respects enabled flag" allowlist: - { id: 1, ip_or_cidr: "10.0.0.1/32", is_enabled: true } - { id: 2, ip_or_cidr: "10.0.0.2/32", is_enabled: false } is_ip_allowed: - { ip: "10.0.0.1", expected: true } - { ip: "10.0.0.2", expected: false } - name: "is_ip_allowed fails closed on empty allowlist" allowlist: [] is_ip_allowed: - { ip: "10.0.0.1", expected: false } # --- IPv4 MATCHING --- - name: "is_ip_allowed matches /32 range" allowlist: - { id: 1, ip_or_cidr: "192.168.1.100/32", is_enabled: true } is_ip_allowed: - { ip: "192.168.1.100", expected: true } - { ip: "192.168.1.101", expected: false } - { ip: "192.168.1.99", expected: false } Unit tests Service tests API tests GUI tests
  37. © 2026 Thoughtworks | Confidential Test the things that matter!

    44 Do you trust our tests? Well, at times the AI changes both tests and code, and then… What I care about: • at checkout, • generate backorders • for products that are low on inventory verify checkInventory called for each cart item mock ⛔ ⛔
  38. © 2026 Thoughtworks | Confidential 45 > If an orders

    brings inventory below safe level, back-order to bring it to twice the safe level | Case | Starting | Ordered | Safe | Ending | Back-ordered | | | inventory | qty | level | inventory | qty | | ----------------------------------------- | ---------- | ------- | ------ | --------- | ------------ | | Above safe after order | 20 | 5 | 10 | 15 | 0 | | Exactly at safe after order | 15 | 5 | 10 | 10 | 0 | | One below safe after order | 14 | 5 | 10 | 9 | 20 | | Already below safe before order | 8 | 1 | 10 | 7 | 20 | | Starts at safe, order reduces it | 10 | 1 | 10 | 9 | 20 | | Order consumes all stock | 5 | 5 | 10 | 0 | 20 | | Order exceeds stock | 5 | 8 | 10 | -3 | 20 | | Safe level is zero | 5 | 5 | 0 | 0 | 0 | | Safe level is zero and negative inventory | 2 | 3 | 0 | -1 | 0 | | Ordered quantity zero | 10 | 0 | 5 | 10 | 0 | Unit tests Service tests API tests GUI tests
  39. © 2026 Thoughtworks | Confidential Testing against remote dependencies 46

    My app Payment processor Wiremock: config for each test 🙁 Hard for stateful cases, eg create and retrieve My app Wiremock { "request": { "method": "POST", "url": "/payments", "bodyPatterns": [ { "matchesJsonPath": "$[?(@.cardNumber == '4111111111111111')]" } ] }, "response": { "status": 200, "jsonBody": { "transactionId": "tx-10001", "status": "APPROVED", "authorizationCode": "AUTH-777888", "amount": 149.99, "currency": "EUR" }, "headers": { "Content-Type": "application/json" } } }
  40. © 2026 Thoughtworks | Confidential Testing against remote dependencies 47

    My app Payment processor AI can build a stateful simulator, based on the service published API My app Custom simulator Based on client name: • Jane OK always accepted • John KO always refused • Jack ERR network error • Joe HANG hangs indefinitely Generate a progressive ID Remember payment requests Return payment status by ID
  41. © 2026 Thoughtworks | Confidential Testing against GMail or Slack?

    48 StrongDM implemented simulators of GMail, Jira and based on their public API
  42. © 2026 Thoughtworks | Confidential What to do now 50

    • No tests? ◦ Pair with the AI, write tabular customer tests • Tests unreadable? ◦ Throw them away ◦ Pair with the AI, write tabular customer tests • Tests not trusted? ◦ Same as above • Tests slow and unreliable? ◦ Test under the skin of the UI ◦ Replace remote dependencies with local versions ◦ Implement a simulator of the remote dependency ◦ Or implement a simulator of the adapter to the remote dependency Make AI coding safe and fun!
  43. © 2026 Thoughtworks | Confidential Train with me! AI and

    legacy modernization • 23 giugno, in presenza, Software Quality Forum • 79 luglio, online, Avanscoperta Both trainings are in Italian) 51
  44. © 2026 Thoughtworks | Confidential Let’s make the world a

    better place Matteo Vaccari Technical Principal [email protected] 52