Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Trust No Bot? Forging Confidence in AI for Soft...

Trust No Bot? Forging Confidence in AI for Software Engineering

Keynote at FORGE 2025, Ottawa, Canada.
https://conf.researchr.org/program/forge-2025

The truth is out there… and so is the AI revolution. Foundation models and AI-driven tools are transforming software engineering, offering unprecedented efficiencies while introducing new uncertainties. As developers, we find ourselves in uncharted territory: these tools promise to accelerate productivity and reshape our workflows, but can we really trust them? Like any good investigator, we must question the systems we rely on. Are AI-based tools reliable, transparent, and aligned with developer needs? Or are they inscrutable black boxes with hidden risks? Trust isn’t just a nice-to-have—it’s the key factor determining whether AI integration succeeds or spirals into skepticism. In this keynote, I will uncover the evolving role of AI in software engineering and explore how we can build, measure, and foster trust in these tools. I will also reveal why the FORGE community is uniquely positioned to lead this charge, ensuring that AI becomes a trusted partner—not an unsolved mystery. After all, when it comes to AI in software development… should we trust no bot? (This abstract came to life with a little help from ChatGPT and a lot of love for THE X-FILES.)

Thomas Zimmermann

April 27, 2025
Tweet

More Decks by Thomas Zimmermann

Other Decks in Research

Transcript

  1. The X-Files (1993–2002): A sci-fi TV series about FBI agents

    investigating unexplained phenomena, conspiracies, and hidden truths. Mulder and Scully: Partners with opposing worldviews—Mulder believes, Scully demands evidence. Core Themes: Trust vs. skepticism; Hidden systems shaping reality; The search for truth under uncertainty Like Mulder and Scully, today’s developers must question, investigate, and demand transparency from AI systems. "The truth is out there…" — but in AI, we have to work harder to find it.
  2. Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou,

    Travis Lowdermilk, and Idan Gazit. ACM Queue, Volume 20, Issue 6, November/December 2022, pp 35–57 Developers reported spending less time on Stack Overflow due to Copilot’s code suggestions. Developers' roles shifted from primarily writing code to reviewing and understanding code suggested by AI. Copilot opened new learning opportunities like mastering new programming languages. Developers' trust plays a crucial role in adoption, as any unexpected behavior can significantly impact its usage.
  3. Trust matters for tool adoption and tool usage So many

    tools... yet so few in use! Meanwhile, tools continue to emerge and evolve from traditional to AI-assisted tools Lack of trust in a tool can lead to suboptimal use and poor outcomes
  4. Designing AI systems for responsible trust is important “Overreliance on

    AI occurs when users start accepting incorrect AI outputs. This can lead to issues and errors… An important goal of AI system design is to empower users to develop appropriate reliance on AI. ” (Passi and Vorvoreanu, 2022)
  5. 18 Developers’ calibrated trust in AI is a prerequisite for

    their safe and effective use of AI tools. Lack of trust hinders adoption Blind trust leads to overlooking mistakes
  6. “The code is watching you” Ghost in the Machine (S1E7)

    – On Halloween, Mulder and Scully investigate the death of a corporate executive who may have been murdered by a thinking computer. AI is built for efficiency—but who ensures its safety and ethics? We must embed responsibility into AI systems from design to deployment. What happens when software outgrows its creators?
  7. “Code with a mind of its own“ Kill Switch (S5E11)

    – The brutal murder of a renowned computer programmer leads Mulder and Scully to investigate an artificial intelligence program loose on the Internet that has begun evolving on its own. How do we align powerful models with human intent? Alignment is trust in action. How do you reason with a system that no longer needs you?
  8. “Trust is the first casualty of secrecy“ E.B.E. (S1E17) –

    Mulder and Scully become the focus of a disinformation campaign when they attempt to trace the government's secret transport of an alien life form. Black-box AI models obscure decision- making processes. Developers need mechanisms for AI transparency and explainability to build trust. If you can't see how it works, can you trust it?
  9. “Programming the human mind“ Wetwired (S3E23) – Mulder and Scully

    investigate a series of murders linked to a mysterious device that alters television signals causing paranoid hallucinations. One of them falls victim to it. AI outputs aren’t just answers—they can manipulate and shape decisions. Trust demands awareness. When AI subtly shifts your thinking, is it still your choice?
  10. “Evidence is buried deeper than data“ The Erlenmeyer Flask (S1E24)

    – While on the trail of a killer with superhuman powers, Mulder and Scully discover a government conspiracy: Secret experiments splice alien and human DNA, hidden from public scrutiny and ethical oversight. What assumptions were coded into your AI? Training data isn’t neutral. Transparency in origin matters. What’s your model hiding in its “DNA”?
  11. What factors influence developers' trust in software tools? Brittany Johnson,

    Christian Bird, Denae Ford, Nicole Forsgren, Thomas Zimmermann: Make Your Tools Sparkle with Trust: The PICSE Framework for Trust in Software Tools. ICSE-SEIP 2023: 409-419
  12. What factors influence developers’ trust in software tools? Interviews with

    18 practitioners Transcribe Code codebook (v1) Thematic Analysis ` ` ` ` ` ` P Validate I C S E Define and discuss trust in tools and collaborators ... Validate Survey with 300+ practitioners
  13. P I C S E ersonal nteraction ontrol ystem xpectations

    Factors related to engagement with tool Intrinsic, extrinsic, and social factors Factors related to control over usage Properties of the tool before and during use Community Source reputation Clear advantages Validation support Feedback loops Educational value Ownership Autonomy Workflow integration Ease of installation Polished presentation Safe and secure Correctness Consistency Performance Meeting expectations Transparent data practices Style matching Goal matching Meeting expectations developers built What factors influence developers’ trust in software tools?
  14. P I C S E ersonal nteraction ontrol ystem xpectations

    Factors related to engagement with tool Intrinsic, extrinsic, and social factors Factors related to control over usage Properties of the tool before and during use Community Source reputation Clear advantages Validation support Feedback loops Educational value Ownership Autonomy Workflow integration Ease of installation Polished presentation Safe and secure Correctness Consistency Performance Meeting expectations Transparent data practices Style matching Goal matching Meeting expectations developers built What factors influence developers’ trust in software tools?
  15. Personal Intrinsic, extrinsic, and social factors Community There's an accessible

    community of developers that use the tool. "That's probably recommended because over the community that's how it's preferable. Then you're leaning towards more into the more community-wide practices." - Software Dev Engineer Lead "Even if I trust the brand, nobody else is on there... I wouldn't download the app, the social media. If there is no network, why would I use it?" - Software Engineer I C S E nteraction ontrol ystem xpectations
  16. Intrinsic, extrinsic, and social factors Source Reputation Personal Intrinsic, extrinsic,

    and social factors I C S E nteraction ontrol ystem xpectations The reputation of or familiarity with the individual, organization, or platform associated with introduction to the tool. "If a person that I personally trust a lot for example, a coworker that I work closely and that I have a lot of respect to, then, of course, that also carries weight. " - Software Engineer " I definitely get more excited about a Microsoft tool or product as opposed to a Google product or an Amazon product. " - Senior Software Engineer
  17. Clear Advantages Personal Intrinsic, extrinsic, and social factors I C

    S E nteraction ontrol ystem xpectations The ability to see the benefits of using the tool, typically from use and validation by others. "When I tuned into that, it was a combination of seeing that and seeing how powerful it was and how easy it was...What am I doing? This looks great." - Systems Engineer "...while I’m in that car and AI Is doing the right thing, I’ll see, it actually stopped the right car. It actually identified that someone crossing the road and all those small nitpick details. Then that trust will build up and I can rely on AI okay." - Software Dev Engineer Lead
  18. Source reputation (especially relevant for adoption) Introduce tools via trusted

    sources Requires knowledge of network (perhaps best for internal tools) Clear advantages (especially relevant for adoption) Provide tool demos and comparisons Create forums for showcasing new tools (particularly internally) Community (before and during use) Build and foster community around tool Make visible and accessible (common on GitHub) Using PICSE for building trust
  19. Generally, similar priorities for both: Consistency and meeting expectations important

    Interaction factors generally less important However, for AI-assisted tools, developers: Prioritize validation support, autonomy, and source reputation (who built it) Deprioritize factors like goal matching and style matching Our study found several similarities between developer trust in AI-assisted tools and trust in their collaborators! Are there differences between trust in traditional tools and AI-assisted tools?
  20. ICSE 2025 Research Track Thu 1 May 2025 15:15 -

    15:30 at 207 - Human and Social using AI 1
  21. How can we design for trust in AI-powered code generation

    tools? Ruotong Wang, Ruijia Cheng, Denae Ford, Thomas Zimmermann: Investigating and Designing for Trust in AI-powered Code Generation Tools. FAccT 2024
  22. The MATCH model for responsible trust Designing for Responsible Trust

    in AI Systems: A Communication Perspective, Q. Vera Liao & S. Shyam Sundar, FAccT 2022 Trustworthiness of AI systems can be communicated via system design. (Liao and Sundar, FAccT 2022)
  23. The MATCH model for responsible trust Designing for Responsible Trust

    in AI Systems: A Communication Perspective, Q. Vera Liao & S. Shyam Sundar, FAccT 2022 A trustworthiness cue is any information within a system that can cue, or contribute to, users’ trust judgements. Trust affordances are displayed properties of a system that engender trustworthiness cues. Trust heuristics are any rules of thumb applied by a user to associate a given cue with a judgment of trustworthiness. (Liao and Sundar, 2022)
  24. Research Questions What do developers need to build appropriate trust

    with AI-code generation tools? What challenges do developers face in the current trust-building process? How can we design UX enhancements to support users building appropriate trust? Study 1: Experience sampling + Debrief interview Study 2: Design probe + Interviews Understand notions of trust Explore potential design solutions
  25. Study 1: Experience sampling + Debrief interview Procedures A week

    of collecting significant moments of using Copilot via screenshots and short descriptions Prompt through Microsoft Teams: When you are appreciative of, frustrated by, or hesitant/uncertain to use the code generation tool Participants Randomly sampled 1500 internal developers + Interns Teams Channel 17 participants with various levels of programming experience and experience with AI-powered code generation tools Example of an experience entry
  26. Finding 1: Developers’ information needs in building appropriate trust Developers

    need to build reasonable expectations of the AI tool’s ability and risks to build appropriate trust • What benefits to expect when collaborating with AI • What use cases to use AI for • What are security and privacy implications of using AI “It comes back to learning what Copilot is suited for versus not suited for, just building the intuition. Once you have that intuition, you don’t put Copilot in to positions where you know it will fail...” (P13)
  27. Developers want information about to what extent and in what

    way they can control and set preference for... • What the AI produces • When and how AI steps in • What code context AI uses “I don’t want Copilot to give me anything unless I type trigger.... It’s too much. It started as a co-pilot, but now it’s the pilot and I’m becoming the co-pilot.” (P8) Finding 1: Developers’ information needs in building appropriate trust
  28. The evaluation of AI suggestions in each specific instance form

    the basis of developers’ trust perception to AI code generation tools. • How good the suggestion is • Why the suggestions are made Strategies to make sure that “the code is actually correct”: • Logically go through the problem • Validate by running the code • Write formal tests Finding 1: Developers’ information needs in building appropriate trust
  29. Expectation of AI’s ability and risks • What benefits to

    expect when collaborating with AI • What use cases to use AI for • What security and privacy implication the AI brings Ways to control AI • What the AI produces • When and how AI steps in • What code context AI uses Quality and reasons of AI suggestions • How good the suggestion is • Why the suggestions are made Finding 1: Developers’ information needs in building appropriate trust
  30. Finding 2: Challenges developers face in building appropriate trust Setting

    proper expectations • Bias from initial experience and experience with similar tools. • “It’s takes three good recommendations to build trust versus one bad recommendation to lose trust.” (P5) Controlling AI tools • Lack of guidance to harness AI • “I felt like a lot of the time I ended up just fighting it.” (P7) Inadequate support for evaluating individual AI suggestions • Lack of debugging support and cognitive load of reviewing • “The code reviews cost you more than the actually writing the code.” (P8)
  31. Study 2: Design probe + Interviews Procedures Using three design

    probes, interview developers about affordance and trustworthiness cues that support building appropriate trust 1. Control mechanisms to set preference 2. Explanation of suggestions 3. Feedback analytics Participants 12 internal and external developers with varied experience in code generation tools, work experience, roles in team
  32. Design recommendations for tool builders Empower users to build appropriate

    expectations by • Communicate the uses cases, and potential risks and benefits of the system • Design for evolving trust Offer affordances and guidance in customizing the system Provide signals for assessing quality of code suggestions
  33. How do online communities affect developers’ trust in AI-powered tools?

    Ruijia Cheng, Ruotong Wang, Thomas Zimmermann, Denae Ford: “It would work for me too”: How Online Communities Shape Software Developers' Trust in AI- Powered Code Generation Tools. ACM Transactions on Interactive Intelligent Systems
  34. Why online communities? Yixuan Zhang, Nurul Suhaimi, Nutchanon Yongsatianchot, Joseph

    D Gaggiano, Miso Kim, Shivani A Patel, Yifan Sun, Stacy Marsella, Jacqueline Griffin, and Andrea G Parker. 2022. Shifting Trust: Examining How Trust and Distrust Emerge, Transform, and Collapse in COVID-19 Information Seeking. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22). Association for Computing Machinery, New York, NY, USA, Article 78, 1–21. https://doi.org/10.1145/3491102.3501889 Trust is shaped by people’s information-seeking and assessment practices through emerging information platforms.” (Zhang et al, 2022)
  35. Design Probe Develop mockup prototypes 11 developers Think out loud

    about prototypes; brainstorm new features Semi-structured interview 17 developer community participants Recruited for a mix of tools and platforms The role of online community in: ◦ Expectation on AI ◦ Use cases of AI ◦ Vulnerable situations w/ AI ◦ … How do online communities shape developers' trust in AI code generation tools? How can we design to facilitate trust building in AI using affordances of online communities? Research Questions
  36. The code has been posted by other programmers, people voted

    on it... If others have used the solution and it worked, it gives you a little more faith. When unsure about AI suggestions, users go to online communities for evaluation. Code solutions in online communities are deemed as trustworthy because of: • Transparent source • Explicit evaluation & triangulation • Credibility from identity signals “ Pathway #1: Community offers evaluation on AI suggestions
  37. Engagement with specific experience shared by others helps users to

    develop: • Reasonable expectation on AI capability • Strategies of when to trust AI • Empirical understanding of suggestions • Awareness of broader implication of AI generated code I read a bunch of what people think of the outcome... [It] helps me make my own perception of whether it is something that is useful for me or not. If everyone has bad experience in the use cases that I care about, I won't trust it at all. Otherwise, I can know where to be careful and what to avoid in the future. “ Pathway #2: Users learn from others’ experience with AI
  38. 62 I once saw an interesting Copilot suggestion and want

    to try it myself. But I couldn’t get it even with the same prompt. I don’t know what their setup is. “ Challenges in effectively using online communities Despite the benefits of sharing specific experience, user sharings lacks: • Project context & replicability • Effective description of interaction with AI • Diversity and relevance
  39. The extended MATCH model with communities Design #2: Community curated

    experience Design #1: Community evaluation signals Community sensemaking Collective heuristics Online communities
  40. Copilot Community Analytics 578 Code snippets similar to this have

    been suggested to users in your organization 52% Accepted w/o editing 12% Rejected directly 36% Made edits 11 2 See similar suggestions in Copilot Community Search code snippet in:
  41. Copilot Community Analytics 578 Code snippets similar to this have

    been suggested to users in your organization 52% Accepted w/o editing 12% Rejected directly 36% Made edits 11 2 See similar suggestions in Copilot Community Search code snippet in: Community usage statistics
  42. Copilot Community Analytics 578 Code snippets similar to this have

    been suggested to users in your organization 52% Accepted w/o editing 12% Rejected directly 36% Made edits 11 2 See similar suggestions in Copilot Community Search code snippet in: Community voting
  43. 578 Code snippets similar to this have been suggested to

    users in your organization 52% Accepted w/o editing 12% Rejected directly 36% Made edits 11 2 See user sharings in Community Search code snippet in: Copilot Community Analytics Identity/ Reputation signals
  44. Design Probes: User Feedback Community Statistics: Helpful, objective metrics for

    users to decide how to trust AI suggestion Need more scaffolds for interpreting the numbers, e.g., user intention and rationales Design 1: Introducing community evaluation signals to the AI code generation experience
  45. User Voting: Proactive way to indicate feedback Want to see

    the outcome of voting (e.g., customization) reflected in future AI suggestions Design Probes: User Feedback Design 1: Introducing community evaluation signals to the AI code generation experience
  46. Identity Signals: Helpful to further interpret the statistics Want more

    relevance, e.g., expertise in specific tasks Need transparency on what data is collected and how the data will be used Design Probes: User Feedback Design 1: Introducing community evaluation signals to the AI code generation experience
  47. Overall: Popup window can be distracting Need more seamless integration

    into programming workflow, e.g., preview and summary Design Probes: User Feedback Design 1: Introducing community evaluation signals to the AI code generation experience
  48. 1 2 3 4 5 6 7 8 9 10

    11 12 13 14 15 16 17 ' use st r i ct ' ; // I ncr ease m ax l i st ener s f or event em i tter s r equi r e( ' event s' ) . Event Em i tter . def aul t M axLi st ener s = 100; const gul p = r equi r e( ' gul p' ) ; const util = r equi r e( ' . /bui l d/l i b/util ' ) ; const pat h = r equi r e( ' pat h' ) ; const com pi l ation = r equi r e( ' . /bui l d/l i b/com pi l ation' ) ; // Fast com pi l e f or devel opm ent tim e gul p. t ask( ' cl ean- cl i ent ' , util . r i m r af ( ' out ' ) ) ; gul p. t ask( ' com pi l e- cl i ent ' , [ ' cl ean- cl i ent ' ] , com pi l ation. com pi l eTask( ' out ' , f al se) ) ; gul p. t ask( ' wat ch- cl i ent ' , [ ' cl ean- cl i ent ' ] , com pi l ation. wat chTask( ' out ' , f al se) ) ; // Ful l com pi l e, i ncl udi ng nl s and i nl i ne sour ces i n sour cem aps, f or bui l d gul p. t ask( ' cl ean- cl i ent - bui l d' , util . r i m r af ( ' out - bui l d' ) ) ; gul p. t ask( ' com pi l e- cl i ent - bui l d' , [ ' cl ean- cl i ent - bui l d' ] , com pi l ation. com pi l eTask( ' out - bui l d' , t r ue) ) ; gul p. t ask( ' wat ch- cl i ent - bui l d' , [ ' cl ean- cl i ent - bui l d' ] , com pi l ation. wat chTask( ' out - bui l d' , t r ue) ) ; 12 4 Explore Copilot Community 8 2 3 3 COPILOT COMMUNITY Similar Experiences See how others in your organization interact with Copilot when getting similar suggestions as the one you got just now. Copilot Auto Recording:
  49. 1 2 3 4 5 6 7 8 9 10

    11 12 13 14 15 16 17 ' use st r i ct ' ; // I ncr ease m ax l i st ener s f or event em i tter s r equi r e( ' event s' ) . Event Em i tter . def aul t M axLi st ener s = 100; const gul p = r equi r e( ' gul p' ) ; const util = r equi r e( ' . /bui l d/l i b/util ' ) ; const pat h = r equi r e( ' pat h' ) ; const com pi l ation = r equi r e( ' . /bui l d/l i b/com pi l ation' ) ; // Fast com pi l e f or devel opm ent tim e gul p. t ask( ' cl ean- cl i ent ' , util . r i m r af ( ' out ' ) ) ; gul p. t ask( ' com pi l e- cl i ent ' , [ ' cl ean- cl i ent ' ] , com pi l ation. com pi l eTask( ' out ' , f al se) ) ; gul p. t ask( ' wat ch- cl i ent ' , [ ' cl ean- cl i ent ' ] , com pi l ation. wat chTask( ' out ' , f al se) ) ; // Ful l com pi l e, i ncl udi ng nl s and i nl i ne sour ces i n sour cem aps, f or bui l d gul p. t ask( ' cl ean- cl i ent - bui l d' , util . r i m r af ( ' out - bui l d' ) ) ; gul p. t ask( ' com pi l e- cl i ent - bui l d' , [ ' cl ean- cl i ent - bui l d' ] , com pi l ation. com pi l eTask( ' out - bui l d' , t r ue) ) ; gul p. t ask( ' wat ch- cl i ent - bui l d' , [ ' cl ean- cl i ent - bui l d' ] , com pi l ation. wat chTask( ' out - bui l d' , t r ue) ) ; 12 4 Explore Copilot Community 8 2 3 3 COPILOT COMMUNITY Similar Experiences See how others in your organization interact with Copilot when getting similar suggestions as the one you got just now. Post Interaction Snippet to Copilot Community Add comments Add title Post Allow link to GitHub Project Tags: JavaScript Add tags Edit Snippet Video: Current length: 30s Save as private Share outside your organization Copilot Auto Recording:
  50. Copilot Community Discovery Interesting suggestion by Copilot in TS 3

    3 156 Views | 1 hour ago My review of Copilot for Ruby 97 Views | 3 hours ago 7 5 Tricks to prompt Copilot 97 Views | 3 hours ago 32 15 Using Copilot to implement a Web App 97 Views | 3 hours ago 7 9 Tut Cop 97 Vi Similar to Your Experiences Interesting suggestion by Copilot in TS 3 3 My review of Copilot for Ruby 3 3 Tricks to prompt Copilot 3 3 Using Copilot to implement a Web App 3 3 Tut Cop Search New Top: Forked Language Sentiment Topics My likes | My GitHub See how others in your organization interact with Copilot when getting similar suggestions as the ones you have gotten.
  51. IDE side panel: Expansion on the community statistics Time consuming

    to watch the videos within a programming session Need a more efficient way to present AI interaction, e.g., code snippets linked to project Assurance of confidentiality in sharing Design Probes: User Feedback Design 2: Developer community dedicated to specific experience with the AI code generation tool
  52. External community: Great for discovery and learning outside programming workflow

    Need more enriched content than a screen recording video, e.g., voice over, text-based tutorial More lightweight options to replication Design Probes: User Feedback Design 2: Developer community dedicated to specific experience with the AI code generation tool
  53. Design Recommendations Dedicated user communities can help developers understand, adopt,

    and develop appropriate trust with code generation AI. The user community should offer: • Scaffolds to share specific, authentic experience with AI • Integration into users' workflow. • Assistance to effectively utilize community content • Assurance for privacy and confidentiality
  54. #1 AI for the entire software lifecycle GitHub Copilot was

    focused on code editing within the IDE. Software creation is more than writing code. Huge opportunity to apply AI to the entire software lifecycle, including modeling of software. The ultimate “shift left”? (AI for SE)
  55. #2 Help people build AI-powered software Future software will be

    AI-powered (“AIware”). How can we model, build, test, and deploy these AIware systems in a scalable and in a disciplined way? Important to avoid “AI debt”. How can we model the architecture of AIware systems. Explainability, Validation, and Verification of AIware systems. (SE for AI)
  56. #3 Provide great human-AI interaction Important to figure out and

    model how humans will interact with AI system. Design an experience that makes the interaction seamless. Consider HCI from the beginning. Systems that adapt and respond dynamically to user preferences.
  57. #4 Leverage AI for software science Huge potential for AI

    to be used in research design, data analysis. Great brainstorming partner. But keep in mind: AI isn't perfect, so people need to vet suggestions. Role of research is changing given the rapid speed of innovation. The output and artifacts of the scientific process are changing.
  58. Can GPT-4 Summarize Papers as Cartoons? Yes! :-) Can GPT-4

    Replicate Empirical Software Engineering Research? Jenny T. Liang, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, Thomas Zimmermann PACMSE (FSE) 2024. AI-generated images may be incorrect. None of the authors wore a lab coat during this research. :-)
  59. #5 Apply AI in a responsible way How to design

    and build software systems using AI in a responsible, ethical way that users can trust and do not negatively affect society? What mechanism and regulations do we need to oversee AI systems? How can we model and verify AI governance and compliance? How about societal impacts, ethical considerations, and human factors?