Trust No Bot? Forging Confidence in AI for Software Engineering

Thomas Zimmermann, UC Irvine, X/@tomzimmermann

Instagram/@finns.pub

The X-Files (1993–2002): A sci-fi TV series about FBI agents
investigating unexplained phenomena, conspiracies, and hidden truths. Mulder and Scully: Partners with opposing worldviews—Mulder believes, Scully demands evidence. Core Themes: Trust vs. skepticism; Hidden systems shaping reality; The search for truth under uncertainty Like Mulder and Scully, today’s developers must question, investigate, and demand transparency from AI systems. "The truth is out there…" — but in AI, we have to work harder to find it.

AI-powered code editing with GitHub Copilot

The rise of AI-powered apps and Copilots

Speed of innovation has increased

Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou,
Travis Lowdermilk, and Idan Gazit. ACM Queue, Volume 20, Issue 6, November/December 2022, pp 35–57 Developers reported spending less time on Stack Overflow due to Copilot’s code suggestions. Developers' roles shifted from primarily writing code to reviewing and understanding code suggested by AI. Copilot opened new learning opportunities like mastering new programming languages. Developers' trust plays a crucial role in adoption, as any unexpected behavior can significantly impact its usage.

Trust matters for tool adoption and tool usage So many
tools... yet so few in use! Meanwhile, tools continue to emerge and evolve from traditional to AI-assisted tools Lack of trust in a tool can lead to suboptimal use and poor outcomes

Developers struggle build appropriate trust with AI-powered code generation tools
***

Developers struggle build appropriate trust with AI-powered code generation tools

Designing AI systems for responsible trust is important “Overreliance on
AI occurs when users start accepting incorrect AI outputs. This can lead to issues and errors… An important goal of AI system design is to empower users to develop appropriate reliance on AI. ” (Passi and Vorvoreanu, 2022)

18 Developers’ calibrated trust in AI is a prerequisite for
their safe and effective use of AI tools. Lack of trust hinders adoption Blind trust leads to overlooking mistakes

“The code is watching you” Ghost in the Machine (S1E7)
– On Halloween, Mulder and Scully investigate the death of a corporate executive who may have been murdered by a thinking computer. AI is built for efficiency—but who ensures its safety and ethics? We must embed responsibility into AI systems from design to deployment. What happens when software outgrows its creators?

“Code with a mind of its own“ Kill Switch (S5E11)
– The brutal murder of a renowned computer programmer leads Mulder and Scully to investigate an artificial intelligence program loose on the Internet that has begun evolving on its own. How do we align powerful models with human intent? Alignment is trust in action. How do you reason with a system that no longer needs you?

“Trust is the first casualty of secrecy“ E.B.E. (S1E17) –
Mulder and Scully become the focus of a disinformation campaign when they attempt to trace the government's secret transport of an alien life form. Black-box AI models obscure decision- making processes. Developers need mechanisms for AI transparency and explainability to build trust. If you can't see how it works, can you trust it?

“Programming the human mind“ Wetwired (S3E23) – Mulder and Scully
investigate a series of murders linked to a mysterious device that alters television signals causing paranoid hallucinations. One of them falls victim to it. AI outputs aren’t just answers—they can manipulate and shape decisions. Trust demands awareness. When AI subtly shifts your thinking, is it still your choice?

“Evidence is buried deeper than data“ The Erlenmeyer Flask (S1E24)
– While on the trail of a killer with superhuman powers, Mulder and Scully discover a government conspiracy: Secret experiments splice alien and human DNA, hidden from public scrutiny and ethical oversight. What assumptions were coded into your AI? Training data isn’t neutral. Transparency in origin matters. What’s your model hiding in its “DNA”?

What factors influence developers' trust in software tools? Brittany Johnson,
Christian Bird, Denae Ford, Nicole Forsgren, Thomas Zimmermann: Make Your Tools Sparkle with Trust: The PICSE Framework for Trust in Software Tools. ICSE-SEIP 2023: 409-419

What factors influence developers’ trust in software tools? Interviews with
18 practitioners Transcribe Code codebook (v1) Thematic Analysis ` ` ` ` ` ` P Validate I C S E Define and discuss trust in tools and collaborators ... Validate Survey with 300+ practitioners

P I C S E ersonal nteraction ontrol ystem xpectations
Factors related to engagement with tool Intrinsic, extrinsic, and social factors Factors related to control over usage Properties of the tool before and during use Community Source reputation Clear advantages Validation support Feedback loops Educational value Ownership Autonomy Workflow integration Ease of installation Polished presentation Safe and secure Correctness Consistency Performance Meeting expectations Transparent data practices Style matching Goal matching Meeting expectations developers built What factors influence developers’ trust in software tools?

Personal Intrinsic, extrinsic, and social factors Community There's an accessible
community of developers that use the tool. "That's probably recommended because over the community that's how it's preferable. Then you're leaning towards more into the more community-wide practices." - Software Dev Engineer Lead "Even if I trust the brand, nobody else is on there... I wouldn't download the app, the social media. If there is no network, why would I use it?" - Software Engineer I C S E nteraction ontrol ystem xpectations

Intrinsic, extrinsic, and social factors Source Reputation Personal Intrinsic, extrinsic,
and social factors I C S E nteraction ontrol ystem xpectations The reputation of or familiarity with the individual, organization, or platform associated with introduction to the tool. "If a person that I personally trust a lot for example, a coworker that I work closely and that I have a lot of respect to, then, of course, that also carries weight. " - Software Engineer " I definitely get more excited about a Microsoft tool or product as opposed to a Google product or an Amazon product. " - Senior Software Engineer

Clear Advantages Personal Intrinsic, extrinsic, and social factors I C
S E nteraction ontrol ystem xpectations The ability to see the benefits of using the tool, typically from use and validation by others. "When I tuned into that, it was a combination of seeing that and seeing how powerful it was and how easy it was...What am I doing? This looks great." - Systems Engineer "...while I’m in that car and AI Is doing the right thing, I’ll see, it actually stopped the right car. It actually identified that someone crossing the road and all those small nitpick details. Then that trust will build up and I can rely on AI okay." - Software Dev Engineer Lead

Source reputation (especially relevant for adoption) Introduce tools via trusted
sources Requires knowledge of network (perhaps best for internal tools) Clear advantages (especially relevant for adoption) Provide tool demos and comparisons Create forums for showcasing new tools (particularly internally) Community (before and during use) Build and foster community around tool Make visible and accessible (common on GitHub) Using PICSE for building trust

Generally, similar priorities for both: Consistency and meeting expectations important
Interaction factors generally less important However, for AI-assisted tools, developers: Prioritize validation support, autonomy, and source reputation (who built it) Deprioritize factors like goal matching and style matching Our study found several similarities between developer trust in AI-assisted tools and trust in their collaborators! Are there differences between trust in traditional tools and AI-assisted tools?

ICSE 2025 Research Track Thu 1 May 2025 15:15 -
15:30 at 207 - Human and Social using AI 1

How can we design for trust in AI-powered code generation
tools? Ruotong Wang, Ruijia Cheng, Denae Ford, Thomas Zimmermann: Investigating and Designing for Trust in AI-powered Code Generation Tools. FAccT 2024

The MATCH model for responsible trust Designing for Responsible Trust
in AI Systems: A Communication Perspective, Q. Vera Liao & S. Shyam Sundar, FAccT 2022 Trustworthiness of AI systems can be communicated via system design. (Liao and Sundar, FAccT 2022)

The MATCH model for responsible trust Designing for Responsible Trust
in AI Systems: A Communication Perspective, Q. Vera Liao & S. Shyam Sundar, FAccT 2022 A trustworthiness cue is any information within a system that can cue, or contribute to, users’ trust judgements. Trust affordances are displayed properties of a system that engender trustworthiness cues. Trust heuristics are any rules of thumb applied by a user to associate a given cue with a judgment of trustworthiness. (Liao and Sundar, 2022)

Research Questions What do developers need to build appropriate trust
with AI-code generation tools? What challenges do developers face in the current trust-building process? How can we design UX enhancements to support users building appropriate trust? Study 1: Experience sampling + Debrief interview Study 2: Design probe + Interviews Understand notions of trust Explore potential design solutions

Study 1: Experience sampling + Debrief interview Procedures A week
of collecting significant moments of using Copilot via screenshots and short descriptions Prompt through Microsoft Teams: When you are appreciative of, frustrated by, or hesitant/uncertain to use the code generation tool Participants Randomly sampled 1500 internal developers + Interns Teams Channel 17 participants with various levels of programming experience and experience with AI-powered code generation tools Example of an experience entry

Finding 1: Developers’ information needs in building appropriate trust Developers
need to build reasonable expectations of the AI tool’s ability and risks to build appropriate trust • What benefits to expect when collaborating with AI • What use cases to use AI for • What are security and privacy implications of using AI “It comes back to learning what Copilot is suited for versus not suited for, just building the intuition. Once you have that intuition, you don’t put Copilot in to positions where you know it will fail...” (P13)

Developers want information about to what extent and in what
way they can control and set preference for... • What the AI produces • When and how AI steps in • What code context AI uses “I don’t want Copilot to give me anything unless I type trigger.... It’s too much. It started as a co-pilot, but now it’s the pilot and I’m becoming the co-pilot.” (P8) Finding 1: Developers’ information needs in building appropriate trust

The evaluation of AI suggestions in each specific instance form
the basis of developers’ trust perception to AI code generation tools. • How good the suggestion is • Why the suggestions are made Strategies to make sure that “the code is actually correct”: • Logically go through the problem • Validate by running the code • Write formal tests Finding 1: Developers’ information needs in building appropriate trust

Expectation of AI’s ability and risks • What benefits to
expect when collaborating with AI • What use cases to use AI for • What security and privacy implication the AI brings Ways to control AI • What the AI produces • When and how AI steps in • What code context AI uses Quality and reasons of AI suggestions • How good the suggestion is • Why the suggestions are made Finding 1: Developers’ information needs in building appropriate trust

Finding 2: Challenges developers face in building appropriate trust Setting
proper expectations • Bias from initial experience and experience with similar tools. • “It’s takes three good recommendations to build trust versus one bad recommendation to lose trust.” (P5) Controlling AI tools • Lack of guidance to harness AI • “I felt like a lot of the time I ended up just fighting it.” (P7) Inadequate support for evaluating individual AI suggestions • Lack of debugging support and cognitive load of reviewing • “The code reviews cost you more than the actually writing the code.” (P8)

Study 2: Design probe + Interviews Procedures Using three design
probes, interview developers about affordance and trustworthiness cues that support building appropriate trust 1. Control mechanisms to set preference 2. Explanation of suggestions 3. Feedback analytics Participants 12 internal and external developers with varied experience in code generation tools, work experience, roles in team

Design recommendations for tool builders Empower users to build appropriate
expectations by • Communicate the uses cases, and potential risks and benefits of the system • Design for evolving trust Offer affordances and guidance in customizing the system Provide signals for assessing quality of code suggestions

How do online communities affect developers’ trust in AI-powered tools?
Ruijia Cheng, Ruotong Wang, Thomas Zimmermann, Denae Ford: “It would work for me too”: How Online Communities Shape Software Developers' Trust in AI- Powered Code Generation Tools. ACM Transactions on Interactive Intelligent Systems

Why online communities? Yixuan Zhang, Nurul Suhaimi, Nutchanon Yongsatianchot, Joseph
D Gaggiano, Miso Kim, Shivani A Patel, Yifan Sun, Stacy Marsella, Jacqueline Griffin, and Andrea G Parker. 2022. Shifting Trust: Examining How Trust and Distrust Emerge, Transform, and Collapse in COVID-19 Information Seeking. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22). Association for Computing Machinery, New York, NY, USA, Article 78, 1–21. https://doi.org/10.1145/3491102.3501889 Trust is shaped by people’s information-seeking and assessment practices through emerging information platforms.” (Zhang et al, 2022)

Extending the MATCH model for responsible trust System focused Online
communities

Design Probe Develop mockup prototypes 11 developers Think out loud
about prototypes; brainstorm new features Semi-structured interview 17 developer community participants Recruited for a mix of tools and platforms The role of online community in: ◦ Expectation on AI ◦ Use cases of AI ◦ Vulnerable situations w/ AI ◦ … How do online communities shape developers' trust in AI code generation tools? How can we design to facilitate trust building in AI using affordances of online communities? Research Questions

The code has been posted by other programmers, people voted
on it... If others have used the solution and it worked, it gives you a little more faith. When unsure about AI suggestions, users go to online communities for evaluation. Code solutions in online communities are deemed as trustworthy because of: • Transparent source • Explicit evaluation & triangulation • Credibility from identity signals “ Pathway #1: Community offers evaluation on AI suggestions

Engagement with specific experience shared by others helps users to
develop: • Reasonable expectation on AI capability • Strategies of when to trust AI • Empirical understanding of suggestions • Awareness of broader implication of AI generated code I read a bunch of what people think of the outcome... [It] helps me make my own perception of whether it is something that is useful for me or not. If everyone has bad experience in the use cases that I care about, I won't trust it at all. Otherwise, I can know where to be careful and what to avoid in the future. “ Pathway #2: Users learn from others’ experience with AI

62 I once saw an interesting Copilot suggestion and want
to try it myself. But I couldn’t get it even with the same prompt. I don’t know what their setup is. “ Challenges in effectively using online communities Despite the benefits of sharing specific experience, user sharings lacks: • Project context & replicability • Effective description of interaction with AI • Diversity and relevance

The extended MATCH model with communities Design #2: Community curated
experience Design #1: Community evaluation signals Community sensemaking Collective heuristics Online communities

Design Probes Design 1: Introducing community evaluation signals to the
AI code generation experience

Copilot Community Analytics 578 Code snippets similar to this have
been suggested to users in your organization 52% Accepted w/o editing 12% Rejected directly 36% Made edits 11 2 See similar suggestions in Copilot Community Search code snippet in:

been suggested to users in your organization 52% Accepted w/o editing 12% Rejected directly 36% Made edits 11 2 See similar suggestions in Copilot Community Search code snippet in: Community usage statistics

been suggested to users in your organization 52% Accepted w/o editing 12% Rejected directly 36% Made edits 11 2 See similar suggestions in Copilot Community Search code snippet in: Community voting

578 Code snippets similar to this have been suggested to
users in your organization 52% Accepted w/o editing 12% Rejected directly 36% Made edits 11 2 See user sharings in Community Search code snippet in: Copilot Community Analytics Identity/ Reputation signals

Design Probes: User Feedback Community Statistics: Helpful, objective metrics for
users to decide how to trust AI suggestion Need more scaffolds for interpreting the numbers, e.g., user intention and rationales Design 1: Introducing community evaluation signals to the AI code generation experience

User Voting: Proactive way to indicate feedback Want to see
the outcome of voting (e.g., customization) reflected in future AI suggestions Design Probes: User Feedback Design 1: Introducing community evaluation signals to the AI code generation experience

Identity Signals: Helpful to further interpret the statistics Want more
relevance, e.g., expertise in specific tasks Need transparency on what data is collected and how the data will be used Design Probes: User Feedback Design 1: Introducing community evaluation signals to the AI code generation experience

Overall: Popup window can be distracting Need more seamless integration
into programming workflow, e.g., preview and summary Design Probes: User Feedback Design 1: Introducing community evaluation signals to the AI code generation experience

Design 2: Developer community dedicated to specific experience with the
AI code generation tool Design Probes

1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 ' use st r i ct ' ; // I ncr ease m ax l i st ener s f or event em i tter s r equi r e( ' event s' ) . Event Em i tter . def aul t M axLi st ener s = 100; const gul p = r equi r e( ' gul p' ) ; const util = r equi r e( ' . /bui l d/l i b/util ' ) ; const pat h = r equi r e( ' pat h' ) ; const com pi l ation = r equi r e( ' . /bui l d/l i b/com pi l ation' ) ; // Fast com pi l e f or devel opm ent tim e gul p. t ask( ' cl ean- cl i ent ' , util . r i m r af ( ' out ' ) ) ; gul p. t ask( ' com pi l e- cl i ent ' , [ ' cl ean- cl i ent ' ] , com pi l ation. com pi l eTask( ' out ' , f al se) ) ; gul p. t ask( ' wat ch- cl i ent ' , [ ' cl ean- cl i ent ' ] , com pi l ation. wat chTask( ' out ' , f al se) ) ; // Ful l com pi l e, i ncl udi ng nl s and i nl i ne sour ces i n sour cem aps, f or bui l d gul p. t ask( ' cl ean- cl i ent - bui l d' , util . r i m r af ( ' out - bui l d' ) ) ; gul p. t ask( ' com pi l e- cl i ent - bui l d' , [ ' cl ean- cl i ent - bui l d' ] , com pi l ation. com pi l eTask( ' out - bui l d' , t r ue) ) ; gul p. t ask( ' wat ch- cl i ent - bui l d' , [ ' cl ean- cl i ent - bui l d' ] , com pi l ation. wat chTask( ' out - bui l d' , t r ue) ) ; 12 4 Explore Copilot Community 8 2 3 3 COPILOT COMMUNITY Similar Experiences See how others in your organization interact with Copilot when getting similar suggestions as the one you got just now. Copilot Auto Recording:

1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 ' use st r i ct ' ; // I ncr ease m ax l i st ener s f or event em i tter s r equi r e( ' event s' ) . Event Em i tter . def aul t M axLi st ener s = 100; const gul p = r equi r e( ' gul p' ) ; const util = r equi r e( ' . /bui l d/l i b/util ' ) ; const pat h = r equi r e( ' pat h' ) ; const com pi l ation = r equi r e( ' . /bui l d/l i b/com pi l ation' ) ; // Fast com pi l e f or devel opm ent tim e gul p. t ask( ' cl ean- cl i ent ' , util . r i m r af ( ' out ' ) ) ; gul p. t ask( ' com pi l e- cl i ent ' , [ ' cl ean- cl i ent ' ] , com pi l ation. com pi l eTask( ' out ' , f al se) ) ; gul p. t ask( ' wat ch- cl i ent ' , [ ' cl ean- cl i ent ' ] , com pi l ation. wat chTask( ' out ' , f al se) ) ; // Ful l com pi l e, i ncl udi ng nl s and i nl i ne sour ces i n sour cem aps, f or bui l d gul p. t ask( ' cl ean- cl i ent - bui l d' , util . r i m r af ( ' out - bui l d' ) ) ; gul p. t ask( ' com pi l e- cl i ent - bui l d' , [ ' cl ean- cl i ent - bui l d' ] , com pi l ation. com pi l eTask( ' out - bui l d' , t r ue) ) ; gul p. t ask( ' wat ch- cl i ent - bui l d' , [ ' cl ean- cl i ent - bui l d' ] , com pi l ation. wat chTask( ' out - bui l d' , t r ue) ) ; 12 4 Explore Copilot Community 8 2 3 3 COPILOT COMMUNITY Similar Experiences See how others in your organization interact with Copilot when getting similar suggestions as the one you got just now. Post Interaction Snippet to Copilot Community Add comments Add title Post Allow link to GitHub Project Tags: JavaScript Add tags Edit Snippet Video: Current length: 30s Save as private Share outside your organization Copilot Auto Recording:

Copilot Community Discovery Interesting suggestion by Copilot in TS 3
3 156 Views | 1 hour ago My review of Copilot for Ruby 97 Views | 3 hours ago 7 5 Tricks to prompt Copilot 97 Views | 3 hours ago 32 15 Using Copilot to implement a Web App 97 Views | 3 hours ago 7 9 Tut Cop 97 Vi Similar to Your Experiences Interesting suggestion by Copilot in TS 3 3 My review of Copilot for Ruby 3 3 Tricks to prompt Copilot 3 3 Using Copilot to implement a Web App 3 3 Tut Cop Search New Top: Forked Language Sentiment Topics My likes | My GitHub See how others in your organization interact with Copilot when getting similar suggestions as the ones you have gotten.

IDE side panel: Expansion on the community statistics Time consuming
to watch the videos within a programming session Need a more efficient way to present AI interaction, e.g., code snippets linked to project Assurance of confidentiality in sharing Design Probes: User Feedback Design 2: Developer community dedicated to specific experience with the AI code generation tool

External community: Great for discovery and learning outside programming workflow
Need more enriched content than a screen recording video, e.g., voice over, text-based tutorial More lightweight options to replication Design Probes: User Feedback Design 2: Developer community dedicated to specific experience with the AI code generation tool

Design Recommendations Dedicated user communities can help developers understand, adopt,
and develop appropriate trust with code generation AI. The user community should offer: • Scaffolds to share specific, authentic experience with AI • Integration into users' workflow. • Assistance to effectively utilize community content • Assurance for privacy and confidentiality

#1 AI for the entire software lifecycle GitHub Copilot was
focused on code editing within the IDE. Software creation is more than writing code. Huge opportunity to apply AI to the entire software lifecycle, including modeling of software. The ultimate “shift left”? (AI for SE)

#2 Help people build AI-powered software Future software will be
AI-powered (“AIware”). How can we model, build, test, and deploy these AIware systems in a scalable and in a disciplined way? Important to avoid “AI debt”. How can we model the architecture of AIware systems. Explainability, Validation, and Verification of AIware systems. (SE for AI)

AIware is democratizing software creation

New AI-focused conferences in SE “Software for all and by
all” is the future of humanity.

#3 Provide great human-AI interaction Important to figure out and
model how humans will interact with AI system. Design an experience that makes the interaction seamless. Consider HCI from the beginning. Systems that adapt and respond dynamically to user preferences.

Mixed Reality Environments

#4 Leverage AI for software science Huge potential for AI
to be used in research design, data analysis. Great brainstorming partner. But keep in mind: AI isn't perfect, so people need to vet suggestions. Role of research is changing given the rapid speed of innovation. The output and artifacts of the scientific process are changing.

Can GPT-4 Summarize Papers as Cartoons? Yes! :-) Can GPT-4
Replicate Empirical Software Engineering Research? Jenny T. Liang, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, Thomas Zimmermann PACMSE (FSE) 2024. AI-generated images may be incorrect. None of the authors wore a lab coat during this research. :-)

https://openai.com/index/paperbench/

#5 Apply AI in a responsible way How to design
and build software systems using AI in a responsible, ethical way that users can trust and do not negatively affect society? What mechanism and regulations do we need to oversee AI systems? How can we model and verify AI governance and compliance? How about societal impacts, ethical considerations, and human factors?

Trust No Bot? Forging Confidence in AI for Soft...

Trust No Bot? Forging Confidence in AI for Software Engineering

More Decks by Thomas Zimmermann

Other Decks in Research

Featured

Transcript