A Harness for Behaviour: how to get AI to generate code that does what we intend, or "TDD in the age of AI"

© 2026 Thoughtworks | Confidential A harness for behaviour: TDD
in the age of AI

© 2026 Thoughtworks | Confidential We are promised a new
golden age 2

© 2026 Thoughtworks | Confidential But quality is declining Nearly
30% of all merged code is AI-generated. While throughput is up, some teams face a 50% increase in defects AI-generated code breaks more often and takes longer to fix: branch success rates dropped to 70.8%, the lowest in over five years CircleCI 2026 State of Software Delivery DX AI-assisted engineering: Q1 impact report Faros AI Engineering Report 2026 • 51% PR Size • 28% Bugs per PR • 5X Median Review Time • 3X Incidents per PR • 10X Code Churn Cloudbeesʼ State of Code Abundance 2026 Increased production incidents reported by 81% respondents

© 2026 Thoughtworks | Confidential Problem statement 4 At the
moment, most people who give high autonomy to their coding agents do this: • A functional specification • Check if: ◦ The AI-generated test suite is green, ◦ Has reasonably high coverage, ◦ Maybe monitor test quality with mutation testing ◦ Then do manual testing Birgitta Böckeler, Distinguished Engineer, Thoughtworks … Is this enough?

© 2026 Thoughtworks | Confidential At the moment, most people
who give high autonomy to their coding agents do this: • A functional specification • Check if: ◦ The AI-generated test suite is green, ◦ Has reasonably high coverage, ◦ Maybe monitor test quality with mutation testing ◦ Then do manual testing 😱 Problem statement 5 Birgitta Böckeler, Distinguished Engineer, Thoughtworks … Is this enough?

© 2026 Thoughtworks | Confidential About me • 1998: PhD
in Formal Methods • 2002: “Discoveredˮ Extreme Programming • 2007-2014: XP coach • 2015-present: Technical Principal @ Thoughtworks • 2025-present: AI-assisted developer @ Thoughtworks 6

© 2026 Thoughtworks | Confidential The old playbook is broken
7

© 2026 Thoughtworks | Confidential The old TDD playbook 8
Kent Beck invented TDD in 1999 Very effective for manual development https://martinfowler.com/bliki/TestDrivenDevelopment.html

© 2026 Thoughtworks | Confidential AI does not like the
TDD rules 9 • AI wants to write tests after the implementation • AI wants to write all the tests at once You can force the AI to follow the strict TDD process, but it takes time and tokens. Is it worth it?

© 2026 Thoughtworks | Confidential AI tests cannot be trusted
10 • AI likes to write tests against implementations • Then AI changes both tests and implementation, destroying our confidence that tests are protecting us Do you trust our tests? Well, at times the AI changes both tests and code, and then…

© 2026 Thoughtworks | Confidential Tests coupled to the implementation
11 Do you trust our tests? Well, at times the AI changes both tests and code, and then… What I care about: • at checkout, • generate backorders • for products that are low on inventory What the AI tests • verify order creation with correct parameters • verify checkInventory called for each cart item • verify returns “orderdoneˮ Ecommerce app The AI tests that • method A calls method B • method B calls method C • etc mock

© 2026 Thoughtworks | Confidential The volume of generated tests
is a problem 12 Reading generated tests is even harder than reading generated code

© 2026 Thoughtworks | Confidential We are all still figuring
this out. This presentation will be different next month 13

© 2026 Thoughtworks | Confidential Automated tests are even more
necessary Nearly 30% of all merged code is AI-generated. While throughput is up, some teams face a 50% increase in defects AI-generated code breaks more often and takes longer to fix: branch success rates dropped to 70.8%, the lowest in over five years CircleCI 2026 State of Software Delivery DX AI-assisted engineering: Q1 impact report Faros AI Engineering Report 2026 • 51% PR Size • 28% Bugs per PR • 5X Median Review Time • 3X Incidents per PR • 10X Code Churn Cloudbeesʼ State of Code Abundance 2026 Increased production incidents reported by 81% respondents

© 2026 Thoughtworks | Confidential 16 A spec without tests
is not a spec The old Extreme Programming playbook recommends writing examples of the expected behaviour. We call them acceptance tests

© 2026 Thoughtworks | Confidential Customer tests to the rescue
Customer tests, aka Acceptance Tests, are a practice of Extreme programming 17

© 2026 Thoughtworks | Confidential Test desiderata 18 AI-generated unit
tests Customer tests Predict success in production No Yes Fast Yes It depends Support refactoring No Yes Low total cost of ownership No Yes See Emily Bache, Test Desiderata 2.0

© 2026 Thoughtworks | Confidential Problem: tests are hard to
read 20 @Test void defaultGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, world!", response.body()); } @Test void personalisedGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello?name=Joe")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, Joe!", response.body()); }

© 2026 Thoughtworks | Confidential Solution: focus on what matters
21 @Test void defaultGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, world!", response.body()); } @Test void personalisedGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello?name=Joe")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, Joe!", response.body()); }

© 2026 Thoughtworks | Confidential Solution: focus on what matters
22 @Test void defaultGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, world!", response.body()); } @Test void personalisedGreeting() throws IOException, InterruptedException { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(baseUrl + "/hello?name=Joe")) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); assertEquals(200, response.statusCode()); assertEquals("Hello, Joe!", response.body()); } /hello → "Hello, world!" /hello?name=Joe → "Hello, Joe!" ☝

© 2026 Thoughtworks | Confidential The new playbook 23 -
description: default greeting request: path: /hello expected_response: body: "Hello, world!" - description: personalised greeting request: path: /hello query: name: Joe expected_response: body: "Hello, Joe!" • Most test should be tabular • Most test tables should be text files Process: 1. The AI builds the test runner 2. The AI writes the first draft of the test table

© 2026 Thoughtworks | Confidential The new playbook 24 -
description: default greeting request: path: /hello expected_response: body: "Hello, world!" - description: personalised greeting request: path: /hello query: name: Joe expected_response: body: "Hello, Joe!" - description: name parameter empty request: path: /hello query: name: "" expected_response: body: "Hello, stranger!" • Most test should be tabular • Most test tables should be text files Process: 1. The AI builds the test runner 2. The AI writes the first draft of the test table 3. You check and integrate the table

© 2026 Thoughtworks | Confidential The idea is not new:
Go 25 var fmtTests = []struct { fmt string val any out string }{ {"%d", 12345, "12345"}, {"%v", 12345, "12345"}, {"%t", true, "true"}, // basic string {"%s", "abc", "abc"}, {"%q", "abc", `"abc"`}, {"%x", "abc", "616263"}, {"%x", "\xff\xf0\x0f\xff", "fff00fff"}, {"%X", "\xff\xf0\x0f\xff", "FFF00FFF"}, {"%x", "", ""}, {"% x", "", ""}, {"%#x", "", ""}, {"%# x", "", ""}, {"%x", "xyz", "78797a"}, {"%X", "xyz", "78797A"}, {"% x", "xyz", "78 79 7a"}, {"% X", "xyz", "78 79 7A"}, {"%#x", "xyz", "0x78797a"}, {"%#X", "xyz", "0X78797A"}, {"%# x", "xyz", "0x78 0x79 0x7a"}, {"%# X", "xyz", "0X78 0X79 0X7A"}, In Go, most tests are tabular These are tests for the “fmtˮ function {"%q", "abc", `"abc"`}, ^ format ^ input ^ expected

© 2026 Thoughtworks | Confidential 26 The approach is similar
to Ward Cunninghamʼs Fit, that interprets doc tables as tests http://fit.c2.com/ The idea is not new: Fit

© 2026 Thoughtworks | Confidential > Premium customers get free
shipping for large orders except for oversized items. | Customer | Order Total | Contains Oversized Item | Expected Shipping | | -------- | ----------- | ----------------------- | ----------------- | | Standard | €40 | No | €6.99 | | Standard | €120 | No | €0.00 | | Premium | €40 | No | €0.00 | | Premium | €40 | Yes | €14.99 | | Premium | €120 | Yes | €14.99 | The idea is not new: Specification by example 27 BDD and Specification by example are about tests as communication

© 2026 Thoughtworks | Confidential 28 A real-life backend test…
- id: "custom reports available to buyer" request: method_and_path: "GET /api/report/custom-reports/" auth_email: "[email protected]" headers: { "x-Role": "buyer" } response: status: 200 json: count: 1 data: - ID: "custom-{{RANDOM_ID}}" NAME: "Custom" <IGNORE_EXTRA_KEYS>: true - id: "custom reports hidden to e2e" request: method_and_path: "GET /api/report/custom-reports/" auth_email: "[email protected]" headers: { "x-Role": "e2e" } response: status: 403 json: error: "Forbidden" message: "Report non disponibili" Test DB HTTP API Backend

© 2026 Thoughtworks | Confidential 29 name: "Audit log page"
start: url: "/admin/audit-log" stubs: - method: "GET" path: "/api/admin/audit-log" fixture: "audit-log.json" steps: # Wait for table to load - action: "waitFor" target: { text: "Creato nuovo mondo" } # Table is sorted by date - action: "assert" assertions: - type: "contents" target: { role: "table" } expected_contents: - ["When", "Action", "Detail", "Author"] - ["12/05/2026 08:51:04", "CREATE", "Created 'new page'", "[email protected]"] - ["13/05/2026 09:51:04", "UPDATE", "Updated 'new page'", "[email protected]"] - ["14/05/2026 10:51:04", "DELETE", "Deleted 'new page'", "[email protected]"] - action: "click" target: { role: "button", name: "Sort by Action" } # Now table is sorted by action - action: "assert" assertions: - type: "contents" target: { role: "table" } expected_contents: - ["When", "Action", "Detail", "Author"] - ["12/05/2026 08:51:04", "CREATE", "Created 'new page'", "[email protected]"] - ["14/05/2026 10:51:04", "DELETE", "Deleted 'new page'", "[email protected]"] - ["13/05/2026 09:51:04", "UPDATE", "Updated 'new page'", "[email protected]"] …and a frontend test Test Stubbed API jsdom Frontend

© 2026 Thoughtworks | Confidential Some things are hard to
explain 30 Excerpt from the Memoir ‘44 rules

© 2026 Thoughtworks | Confidential 💡 Invent a testing mini-language
31 - name: line of sight blocked by woods board: - ". . . . . " - " . B . . ." - "A . W C . " - " . . . . ." units: - id: A side: Allies type: Infantry - id: B side: Axis type: Infantry terrain: Woods - id: C side: Axis type: Infantry assert_available_actions: - Battle A to B

© 2026 Thoughtworks | Confidential 32 Actually I copied this
from Ivett See Ivett Ördögʼs Approved Scenarios pattern and video

© 2026 Thoughtworks | Confidential The core idea 1. Tests
should express what you care about 34 …so that you can understand them

should express what you care about 2. Tests should be tabular 35 …so that you can understand them …so that itʼs easy to add tests

should express what you care about 2. Tests should be tabular 3. Tests should be data not code 36 …so that you can understand them …so that itʼs easy to add tests so that I can easily spot when AI is changing them

should express what you care about 2. Tests should be tabular 3. Tests should be data not code 4. Black box testing 37 …so that you can understand them …so that itʼs easy to add tests so that I can easily spot when AI is changing them so that the tests are not coupled to the implementation

© 2026 Thoughtworks | Confidential What about the test pyramid?
39 Unit tests Service tests API tests GUI tests Slow and expensive Fast and cheap

© 2026 Thoughtworks | Confidential Unit tests still have a
place 40 Component A Component B Component C Test We want to test interesting behaviour of C

place 41 Component A Component B Component C Test But we must “persuadeˮ A and B to set up C the way we want We want to test interesting behaviour of C

place 42 Component A Component B Component C Test But we must “persuadeˮ A and B to set up C the way we want We want to test interesting behaviour of C The “build a ship in a bottleˮ effect

© 2026 Thoughtworks | Confidential Customer tests implemented as unit
tests 43 tests: - name: "is_ip_allowed respects enabled flag" allowlist: - { id: 1, ip_or_cidr: "10.0.0.1/32", is_enabled: true } - { id: 2, ip_or_cidr: "10.0.0.2/32", is_enabled: false } is_ip_allowed: - { ip: "10.0.0.1", expected: true } - { ip: "10.0.0.2", expected: false } - name: "is_ip_allowed fails closed on empty allowlist" allowlist: [] is_ip_allowed: - { ip: "10.0.0.1", expected: false } # --- IPv4 MATCHING --- - name: "is_ip_allowed matches /32 range" allowlist: - { id: 1, ip_or_cidr: "192.168.1.100/32", is_enabled: true } is_ip_allowed: - { ip: "192.168.1.100", expected: true } - { ip: "192.168.1.101", expected: false } - { ip: "192.168.1.99", expected: false } Unit tests Service tests API tests GUI tests

© 2026 Thoughtworks | Confidential Test the things that matter!
44 Do you trust our tests? Well, at times the AI changes both tests and code, and then… What I care about: • at checkout, • generate backorders • for products that are low on inventory verify checkInventory called for each cart item mock ⛔ ⛔

© 2026 Thoughtworks | Confidential 45 > If an orders
brings inventory below safe level, back-order to bring it to twice the safe level | Case | Starting | Ordered | Safe | Ending | Back-ordered | | | inventory | qty | level | inventory | qty | | ----------------------------------------- | ---------- | ------- | ------ | --------- | ------------ | | Above safe after order | 20 | 5 | 10 | 15 | 0 | | Exactly at safe after order | 15 | 5 | 10 | 10 | 0 | | One below safe after order | 14 | 5 | 10 | 9 | 20 | | Already below safe before order | 8 | 1 | 10 | 7 | 20 | | Starts at safe, order reduces it | 10 | 1 | 10 | 9 | 20 | | Order consumes all stock | 5 | 5 | 10 | 0 | 20 | | Order exceeds stock | 5 | 8 | 10 | -3 | 20 | | Safe level is zero | 5 | 5 | 0 | 0 | 0 | | Safe level is zero and negative inventory | 2 | 3 | 0 | -1 | 0 | | Ordered quantity zero | 10 | 0 | 5 | 10 | 0 | Unit tests Service tests API tests GUI tests

© 2026 Thoughtworks | Confidential Testing against remote dependencies 46
My app Payment processor Wiremock: config for each test 🙁 Hard for stateful cases, eg create and retrieve My app Wiremock { "request": { "method": "POST", "url": "/payments", "bodyPatterns": [ { "matchesJsonPath": "$[?(@.cardNumber == '4111111111111111')]" } ] }, "response": { "status": 200, "jsonBody": { "transactionId": "tx-10001", "status": "APPROVED", "authorizationCode": "AUTH-777888", "amount": 149.99, "currency": "EUR" }, "headers": { "Content-Type": "application/json" } } }

© 2026 Thoughtworks | Confidential Testing against remote dependencies 47
My app Payment processor AI can build a stateful simulator, based on the service published API My app Custom simulator Based on client name: • Jane OK always accepted • John KO always refused • Jack ERR network error • Joe HANG hangs indefinitely Generate a progressive ID Remember payment requests Return payment status by ID

© 2026 Thoughtworks | Confidential Testing against GMail or Slack?
48 StrongDM implemented simulators of GMail, Jira and based on their public API

© 2026 Thoughtworks | Confidential What to do now 50
• No tests? ◦ Pair with the AI, write tabular customer tests • Tests unreadable? ◦ Throw them away ◦ Pair with the AI, write tabular customer tests • Tests not trusted? ◦ Same as above • Tests slow and unreliable? ◦ Test under the skin of the UI ◦ Replace remote dependencies with local versions ◦ Implement a simulator of the remote dependency ◦ Or implement a simulator of the adapter to the remote dependency Make AI coding safe and fun!

© 2026 Thoughtworks | Confidential Train with me! AI and
legacy modernization • 23 giugno, in presenza, Software Quality Forum • 79 luglio, online, Avanscoperta Both trainings are in Italian) 51

© 2026 Thoughtworks | Confidential Let’s make the world a
better place Matteo Vaccari Technical Principal [email protected] 52

A Harness for Behaviour: how to get AI to gener...

A Harness for Behaviour: how to get AI to generate code that does what we intend, or "TDD in the age of AI"

More Decks by Matteo Vaccari

Other Decks in Technology

Featured

Transcript