in their integrated development environment (IDE) have become the most frequently used kind of programmer assistance.1 When generating whole snippets of code, they typically use a large language model (LLM) to predict what the user might type next (the completion) from the context of what they are working on at the moment (the prompt).2 This system allows for completions at any position in Measuring GitHub Copilot’s Impact on Productivity DOI:10.1145/3633453 Case study asks Copilot users about its impact on their productivity, and seeks to find their perceptions mirrored in user data. BY ALBERT ZIEGLER, EIRINI KALLIAMVAKOU, X. ALICE LI, ANDREW RICE, DEVON RIFKIN, SHAWN SIMISTER, GANESH SITTAMPALAM, AND EDWARD AFTANDILIAN key insights AI pair-programming tools such as GitHub Copilot have a big impact on developer productivity. This holds for developers of all skill levels, with junior developers seeing the largest gains. The reported benefits of receiving AI suggestions while coding span the full range of typically investigated aspects of productivity, such as task time, product quality, cognitive load, enjoyment, and learning. Perceived productivity gains are reflected in objective measurements of developer activity. While suggestion correctness is important, the driving factor for these improvements appears to be not correctness as such, but whether the suggestions are useful as a starting point for further development. 54 COMMUNICATIONS OF THE ACM | MARCH 2024 | VOL. 67 | NO. 3 research the code, often spanning multiple lines at once. Potential benefits of generating large sections of code automatically are huge, but evaluating these sys- tems is challenging. Offline evalua- tion, where the system is shown a par- tial snippet of code and then asked to complete it, is difficult not least because for longer completions there are many acceptable alternatives and no straightforward mechanism for labeling them automatically.5 An ad- ditional step taken by some research- ers3,21,29 is to use online evaluation and track the frequency of real us- ers accepting suggestions, assuming that the more contributions a system makes to the developer’s code, the higher its benefit. The validity of this assumption is not obvious when con- sidering issues such as whether two short completions are more valuable than one long one, or whether review- ing suggestions can be detrimental to programming flow. Code completion in IDEs using lan- guage models was first proposed in Hindle et al.,9 and today neural syn- thesis tools such as GitHub Copilot, CodeWhisperer, and TabNine suggest code snippets within an IDE with the explicitly stated intention to increase a user’s productivity. Developer pro- ductivity has many aspects, and a re- cent study has shown that tools like these are helpful in ways that are only partially reflected by measures such as completion times for standardized tasks.23,a Alternatively, we can leverage the developers themselves as expert assessors of their own productivity. This meshes well with current think- ing in software engineering research suggesting measuring productiv- ity on multiple dimensions and using self-reported data.6 Thus, we focus on studying perceived productivity. Here, we investigate whether usage measurements of developer interac- tions with GitHub Copilot can predict perceived productivity as reported by developers. We analyze 2,631 sur- a Nevertheless, such completion times are greatly reduced in many settings, often by more than half.16 MARCH 2024 | VOL. 67 | NO. 3 | COMMUNICATIONS OF THE ACM 55 ILLUSTRATION BY JUSTIN METZ • • • (explain) • (brushes) • • A B • • GitHub Copilot ( ) Comm. ACM. 67(3) https://doi.org/10.1145/3633453 GitHub Next / Copilot Labs https://githubnext.com/