Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Generative AI makes collaborative, versione...

Arfon Smith
March 13, 2025
12

Why Generative AI makes collaborative, versioned science more important than ever

Generative AI is transforming research, and to ensure this transformation leads to the anticipated benefits—more science, faster—the need for Collaborative, Versioned Science (CVS) has never been more urgent. In this presentation, I explore how open, modular, and version-controlled scientific workflows enhance transparency, reproducibility, and inclusivity in research. By leveraging open-source methodologies, automated reproducibility checks, and AI-assisted peer review, CVS provides a framework to uphold the integrity and credibility of scientific outputs in an AI-driven landscape.

I also examine the challenges posed by generative AI—such as hallucinations, biases, and the rapid acceleration of research production—and argue that structured, version-controlled collaboration is essential to maintaining scientific rigor.

Arfon Smith

March 13, 2025
Tweet

Transcript

  1. Arfon Smith / 17 December 2024 Schmidt Sciences Why Generative

    AI makes Collaborative, Versioned Science more important than ever
  2. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Plan for today 1. De fi ning Collaborative, Versioned Science 
 2. Generative AI is here
 
 
 3. Making predictions, anticipating challenges What is is, what it’s core qualities are, why it matters. Current capabilities, what’s readily possible, where some challenges lie, where we might be heading. Why Collaborative, Versioned Science might be the answer.
  3. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Collaborative, Versioned Science
  4. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Open Source: Right to modify, not to contribute Open source refers to material (often software) released under terms that allow it to be freely shared, used, and modi fi ed by anyone. Open source projects often, though not always, also have a highly collaborative development process and are receptive to contributions of code, documentation, discussion, etc from anyone who shows competent interest.
  5. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Collaborative Open Source “Open source” way of working Modular, composable, reusable Transparency and inclusivity Process automation Documented High-quality review processes Well structured open code, with clear rules around reuse. Clear rules around governance, how decisions are made, roadmaps and project goals. Automation around testing, communications, and other key activities. How to use, how to contribute, guided tutorials, all in electronic form. Code review, process updates, testing procedures.
  6. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Collaborative Modular, composable, reusable Transparency and inclusivity Process automation Documented High-quality review processes Well-structured, interoperable research methods and data (protocols, datasets, and 
 pipelines). Open frameworks to accelerate time to scienti fi c value. Open governance and community-driven science. Transparent decision making, sharing 
 of fi ndings, assigning of credit. Diverse contributors (academic, industry, citizen science). Automated reproducibility and validation: Automated work fl ows for data analysis, 
 hypothesis testing, and updating research outputs and ensuring accuracy. Comprehensive and accessible scienti fi c knowledge. Open protocols, how-to guides, and tutorials for reproducing experiments. Step-by-step instructions simplify access. Rigorous peer review and testing. Community-led, version-controlled review systems ensure that research is thoroughly tested, validated, and reproducible. Collaborative, Versioned Science
  7. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Transparent, 
 documented decisions 1/3
  8. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Transparent, 
 documented decisions 2/3
  9. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Transparent, 
 documented decisions 3/3
  10. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever High-quality review 
 processes
  11. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Collaborative, reproducible computational work fl ows
  12. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Platform for scalable 
 climate and geoscience 
 analyses
  13. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Collaborative catalyst 
 research and evaluation
  14. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Transparent, open 
 review process 2/4
  15. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Transparent, open 
 review process 2
  16. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever “reproducible by necessity” Fernando Perez (IPython, Jupyter, Berkeley) https://web.archive.org/web/20140214000007/http://blog.fperez.org/
  17. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Collaborative Modular, composable, reusable Transparency and inclusivity Process automation Documented High-quality review processes Well-structured, interoperable research methods and data (protocols, datasets, and 
 pipelines). Open frameworks to accelerate time to scienti fi c value. Open governance and community-driven science. Transparent decision making, sharing 
 of fi ndings, assigning of credit. Diverse contributors (academic, industry, citizen science). Automated reproducibility and validation: Automated work fl ows for data analysis, 
 hypothesis testing, and updating research outputs and ensuring accuracy. Comprehensive and accessible scienti fi c knowledge. Open protocols, how-to guides, and tutorials for reproducing experiments. Step-by-step instructions simplify access. Rigorous peer review and testing. Community-led, version-controlled review systems ensure that research is thoroughly tested, validated, and reproducible. Collaborative, Versioned Science
  18. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Some working assumptions Based on the last ~2.5 years actively building with them The people around you are likely using it There is a “there” there “Moore’s law for LLMs” will continue to hold People are trying all sorts of crazy things As in fl uencers of the future, we should have opinions Whether it’s ChatGPT or GitHub Copilot, these are technologies people are using. Generative AI can be genuinely useful when applied to the right problems. Models will likely become more capable, costs will reduce, more will be possible for less. Just check your favourite tech news site. And I’m sharing mine with you today.
  19. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Quick survey of the room Using generative AI in their daily work? Building a system that incorporates LLMs? Exploring capabilities as part of their work? Living with the consequences of LLMs in their work? Building their own base model? e.g., GitHub Copilot (or other code tool), ChatGPT, Claude, something else. Building a net-new piece of infrastructure or building something new? e.g., Evaluating models for existing or future workloads? e.g., A collaborator using generative AI tools. Advanced mode… With a show of hands…
  20. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever What are they capable of? Natural language processing tasks Synthesizing information Personalization (ELI5, ELIM, ELIDKCVW*) Fine tuning for domain-speci fi c tasks or behaviours Exploring topics, generating ideas. Text classi fi cation, sentiment analysis, named entity recognition, intent detection. Especially when combined with techniques like Retrieval Augmented Generation (RAG) Customized responses based on individual preferences/background/knowledge. Conversational agents, code generation, tool calling. With some caveats, can be excellent tools for brainstorming and learning. * Explain Like I Don’t Know C Very Well Current capabilities being leveraged widely.
  21. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever What are some challenges? Hallucinations Biases, safety, security Inconsistency, non-determinism, and evaluation Structured outputs Unjusti fi ed certainty in outputs They are always hallucinating, it’s just sometimes they are useful. Re fl ecting, and sometimes amplifying biases present in training data or fi ne tuning. Di ff erent answers for the same questions due to random sampling and model state. Although many models have now been trained for this speci fi cally. Generate detailed responses without any sense of reality.
  22. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever What are some challenges? Hallucinations Biases, safety, security Inconsistency, non-determinism, and evaluation Structured outputs Unjusti fi ed certainty in outputs They are always hallucinating, it’s just sometimes they are useful. Re fl ecting, and sometimes amplifying biases present in training data or fi ne tuning. Di ff erent answers for the same questions due to random sampling and model state. Although many models have now been trained for this speci fi cally. Generate detailed responses without any sense of reality.
  23. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever What are some challenges? Hallucinations Biases, safety, security Inconsistency, non-determinism, and evaluation Structured outputs Unjusti fi ed certainty in outputs They are always hallucinating, it’s just sometimes they are useful. Re fl ecting, and sometimes amplifying biases present in training data or fi ne tuning. Di ff erent answers for the same questions due to random sampling and model state. Although many models have now been trained for this speci fi cally. Generate detailed responses without any sense of reality.
  24. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever What are some challenges? Hallucinations Biases, safety, security Inconsistency, non-determinism, and evaluation Structured outputs Unjusti fi ed certainty in outputs They are always hallucinating, it’s just sometimes they are useful. Re fl ecting, and sometimes amplifying biases present in training data or fi ne tuning. Di ff erent answers for the same questions due to random sampling and model state. Although many models have now been trained for this speci fi cally. Generate detailed responses without any sense of reality.
  25. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI is here Meta-level capabilities Amplifying human cognition Summarizing large amounts of content (e.g., through retrieval augmentation generation), providing a mechanism for rapid ideation through conversational chatbots, retrieving information from a broad search space. Accelerating work through assistance Summarizing content into more digestible forms, generating new representations of information. Supporting analysts generating SQL queries to support business reporting. Extending creativity and adapting to individual needs Content generation (e.g., creative writing, event planning), exploration new topics, and personalization of outputs speci fi c to the individual needs or preferences of the user.
  26. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI is here More speci fi cally (ex. software engineering) Amplifying human cognition Generating new code based on training. Explaining existing code, explaining the call chain, retrieving relevant context across the entire software development lifecycle, and summarising. Accelerating work through assistance Authoring new code, generating documentation, providing fi rst code review based on business rules that would have otherwise occupied human time. Extending creativity and adapting to individual needs Translating, content generation (e.g., creative writing, event planning), exploration new topics, and personalization of outputs speci fi c to the individual needs or preferences of the user.
  27. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI is here More speci fi cally (ex. software engineering) Amplifying human cognition Generating new code based on training. Explaining existing code, explaining the call chain, retrieving relevant context across the entire software development lifecycle, and summarising. Accelerating work through assistance Authoring new code, generating documentation, providing fi rst code review based on business rules that would have otherwise occupied human time. Extending creativity and adapting to individual needs Translating, content generation (e.g., creative writing, event planning), exploration new topics, and personalization of outputs speci fi c to the individual needs or preferences of the user.
  28. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI is here More speci fi cally (ex. software engineering) Amplifying human cognition Generating new code based on training. Explaining existing code, explaining the call chain, retrieving relevant context across the entire software development lifecycle, and summarising. Accelerating work through assistance Authoring new code, generating documentation, providing fi rst code review based on business rules that would have otherwise occupied human time. Extending creativity and adapting to individual needs Translating, content generation (e.g., creative writing, event planning), exploration new topics, and personalization of outputs speci fi c to the individual needs or preferences of the user.
  29. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI is here More speci fi cally (ex. software engineering) Amplifying human cognition Generating new code based on training. Explaining existing code, explaining the call chain, retrieving relevant context across the entire software development lifecycle, and summarising. Accelerating work through assistance Authoring new code, generating documentation, providing fi rst code review based on business rules that would have otherwise occupied human time. Extending creativity and adapting to individual needs Translating, content generation (e.g., creative writing, event planning), exploration new topics, and personalization of outputs speci fi c to the individual needs or preferences of the user.
  30. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI is here More speci fi cally (ex. software engineering) Amplifying human cognition Generating new code based on training. Explaining existing code, explaining the call chain, retrieving relevant context across the entire software development lifecycle, and summarising. Accelerating work through assistance Authoring new code, generating documentation, providing fi rst code review based on business rules that would have otherwise occupied human time. Extending creativity and adapting to individual needs Translating, content generation (e.g., creative writing, event planning), exploration new topics, and personalization of outputs speci fi c to the individual needs or preferences of the user.
  31. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Adoption and perception of AI tools Signi fi cant adoption in many fi elds 76% are using or are planning to use AI tools 81% cite increased productivity as biggest bene fi t Trust in output/accuracy of tools mixed 70% of developers do not perceive AI as a job threat 200M+ ChatGPT users, 1M+ GitHub Copilot users. Up from 70% last year. Novice developers cite accelerating learning as biggest bene fi t. 43% feel positive, 31% skeptical. Learners trust more than experienced developers. Those who *do* is marginally higher for learners (15% vs 12%). Stack Overflow
  32. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Signi fi cant value in many fi elds Models, architectures, tools, and platforms maturing
  33. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI is here Models are getting more and more capable ~5% month over month improvements in last year (on new benchmarks) SWE-Bench Veri fi ed
  34. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI is here We’re learning how to use them more e ff ectively Re fl ecting on a task with the LLM Iterative generation and evaluation of outputs with feedback via self-critiques and/or new information (e.g., running tests and inspecting outputs). Tool use / function calling Models fi ne-tuned to use tools, extracting parameters from interactions and passing them to tools to augment responses (e.g., search the web, execute code, detect objects). Planning and reasoning Ask the model to build a plan/sequence of actions to take to solve the incoming request/task before executing on them. Multi-agent collaboration Prompt model(s) to take on di ff erent roles. LLM operates with a degree of autonomy to complete a task, often managing its own internal processes and sub-tasks to achieve a goal. https://www.promptingguide.ai/research/llm-agents
  35. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI is here Plethora of platforms, tools, and technologies https://www.sequoiacap.com/article/llm-stack-perspective/
  36. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI (in science) is here
  37. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generic tools are already useful Signi fi cant value with ‘o ff the shelf tools’ for many research activities Amplifying human cognition Summarizing large amounts of content (e.g., through retrieval augmentation generation), providing a mechanism for rapid ideation through conversational chatbots, retrieving information from a broad search space. Accelerating work through assistance Summarizing content into more digestible forms, generating new representations of information. Supporting analysts generating SQL queries to support business reporting. Extending creativity and adapting to individual needs Content generation (e.g., creative writing, event planning), exploration new topics, and personalization of outputs speci fi c to the individual needs or preferences of the user.
  38. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Domain-level models* AstroLLaMA: Towards Specialized Foundation Models in Astronomy. arXiv:2309.06126 Fine-tuned LLaMA-2 variant Improved context-awareness over LLaMA-2 + GPT4 Higher- fi delity embeddings Tuned on 300,000+ abstracts on the arXiv. ‘Completions’ of abstracts show a deep(er) understanding of astronomical concepts. Capable of facilitating better document retrieval and semantic analysis. * Typically fine-tuned
  39. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Domain-level specializations Scientific Large Language Models: A Survey on Biological & Chemical Domains. arXiv: 2401.14656
  40. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI (in science) is here Scienti fi c process Inquiring Hypothesize Experiment Evaluate
  41. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI (in science) is here Inquiring Knowledge sourced via retrieval or model training Model knowledge derived from training, retrieved knowledge derived from various levels of retrieval (RAG) complexity. Retrieval can be as simple as web search See ChatGPT and Bing as examples. User question triggers custom web searches. Sophisticated agents can exceed human abilities e.g., PaperQA2 from Future House exceeds subject matter experts on realistic literature research tasks. Relatively low cost ($1-3 per query) Prompt model(s) to take on di ff erent roles. LLM operates with a degree of autonomy to complete a task, often managing its own internal processes and sub-tasks to achieve a goal. LANGUAGE AGENTS ACHIEVE SUPERHUMAN SYNTHESIS OF SCIENTIFIC KNOWLEDGE Michael D. Skarlinski1 Sam Cox1,2 Jon M. Laurent1 James D. Braza1 Michaela Hinks1 Michael J. Hammerling1 Manvitha Ponnapati1 Samuel G. Rodriques1,3⇤ Andrew D. White1,2⇤ 1FutureHouse Inc., San Francisco, CA 2University of Rochester, Rochester, NY 3 Francis Crick Institute, London, UK ⇤These authors jointly supervise technical work at FutureHouse. Correspondence to: {sam,andrew}@futurehouse.org ABSTRACT Language models are known to “hallucinate” incorrect information, and it is unclear if they are sufficiently accurate and reliable for use in scientific research. We developed a rigorous human-AI comparison methodology to evaluate language model agents on real-world literature search tasks covering information retrieval, summarization, and contradiction detection tasks. We show that PaperQA2, a frontier language model agent optimized for improved factuality, matches or exceeds subject matter expert performance on three realistic literature research tasks without any restrictions on humans (i.e., full access to internet, search tools, and time). PaperQA2 writes cited, Wikipedia- style summaries of scientific topics that are significantly more accurate than existing, human-written Wikipedia articles. We also introduce a hard benchmark for scientific literature research called LitQA2 that guided design of PaperQA2, leading to it exceeding human performance. Finally, we apply PaperQA2 to identify contradictions within the scientific literature, an important scientific task that is challenging for humans. PaperQA2 identifies 2.34 ± 1.99 (mean ± SD, N = 93 papers) contradictions per paper in a random subset of biology papers, of which 70% are validated by human experts. These results demonstrate that language model agents are now capable of exceeding domain experts across meaningful tasks on scientific literature. 1 Introduction Large language models (LLMs) have the potential to assist scientists with retrieving, synthesizing, and summarizing the literature1,2,3, but still have several limitations for use in research tasks. Firstly, factuality is essential in scientific research, and LLMs hallucinate4, confidently stating information that is not grounded in any existing source or evidence. Secondly, science requires extreme attention to detail, and LLMs can overlook or misuse details when faced with challenging reasoning problems5. Finally, benchmarks for retrieval and reasoning across the scientific literature today are underdeveloped. They do not consider the entire literature, but instead are restricted to abstracts6, retrieval on a fixed corpus7, or simply provide the relevant paper directly8. These benchmarks are not suitable as performance proxies for real scientific research tasks, and, more importantly, often lack a direct comparison to human performance. Thus, it remains unclear whether language models and agents are suitable for use in scientific research. We therefore set out to develop a rigorous comparison between the performance of AI systems and humans on three real-world tasks: a retrieval task involving searching the entire literature to answer questions; a summarization task involving producing a cited, Wikipedia-style articles on scientific topics; and a contradiction-detection task, involving extracting all claims from papers and checking them for contradictions against all of literature. This is, to our knowledge, arXiv:2409.13740v2 [cs.CL] 26 Sep 2024
  42. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI (in science) is here Hypothesis forming Leverage relatively simple RAG architecture Retrieving information from a subset of the NASA ADS (journal index database) Multiple GPT-4 instances working ‘adversarially’ Generation, critiquing, feedback moderation provided by separate models. Promise of ‘lowering barriers to realizing value’ Authors cite relatively simple ‘human in the loop’ architecture may allow many to realize signi fi cant value. Harnessing the Power of Adversarial Prompting and Large Language Models for Robust Hypothesis Generation in Astronomy Ioana Ciuc˘ a * 1 2 Yuan-Sen Ting * 1 2 Sandor Kruk 3 Kartheik Iyer 4 Abstract This study investigates the application of Large Language Models (LLMs), specifically GPT-4, within Astronomy. We employ in-context prompt- ing, supplying the model with up to 1000 papers from the NASA Astrophysics Data System, to explore the extent to which performance can be improved by immersing the model in domain- specific literature. Our findings point towards a substantial boost in hypothesis generation when using in-context prompting, a benefit that is fur- ther accentuated by adversarial prompting. We illustrate how adversarial prompting empowers GPT-4 to extract essential details from a vast knowledge base to produce meaningful hypothe- ses, signaling an innovative step towards employ- ing LLMs for scientific research in Astronomy. 1. Introduction Significant strides in Natural Language Processing (NLP) have been made possible through attention mechanisms and transformer architecture, leading to the development of Large Language Models (LLMs) such as GPT-4 (Vig, 2019; Brown et al., 2020; Ouyang et al., 2022). These models exhibit extraordinary aptitude in understanding, generating, and interacting with human language. They go beyond de- ciphering complex linguistic patterns to making non-trivial deductions and forming relationships across diverse contexts (e.g., Devlin et al., 2018; Elkins & Chun, 2020). Two intriguing facets of these models have stirred excite- *Equal contribution 1Research School of Astronomy & Astro- physics, Australian National University, Cotter Rd., Weston, ACT 2611, Australia 2School of Computing, Australian National Uni- versity, Acton, ACT 2601, Australia 3European Space Astronomy Centre, European Space Agency, Villafranca del Castillo, Madrid 28692, Spain 4Columbia Astrophysics Laboratory, Columbia University, New York, NY 10027, USA. Correspondence to: Ioana Ciuca <[email protected]>, Yuan-Sen Ting <yuan- [email protected]>. Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ment for their potential that surpasses their initial intended applications. Firstly, despite LLMs’ propensity to sample posterior means of languages—a factor that can occasion- ally result in non-trivial hallucination problems—improved performance has been witnessed through in-context prompt- ing (Wang et al., 2022; Wei et al., 2022; Zhang et al., 2022). This enhancement enables them to handle complex, domain-specific tasks (e.g., Radford & Narasimhan, 2018; Brown et al., 2020; Lu et al., 2022). Secondly, these mod- els, when combined with revolutionary technologies like Langchain1 to provide extensive context to the LLMs, expand their functionality across a wide range of fields. While methods like the use of adapters (He et al., 2021; Karimi Mahabadi et al., 2021; Hu et al., 2021) can re- markably augment performance for domain-specific tasks through fine-tuning the LLMs, these approaches often prove challenging for institutions without sufficient resources. In this study, we delve into the application of low-cost in- context prompting (Chen et al., 2021; Xie et al., 2021) in the realm of astronomy. Astronomy offers a compelling case study due to three key reasons. Firstly, although the field is rich in literature, the inclusion of such text in the vast corpus used to train GPT models is probably limited. This lack leads to noticeable hallucination problems when employing naive versions of LLMs (Ciuc˘ a et al., 2023). Secondly, unlike domains that focus more on intensive, detailed studies, advancements in astronomy often stem from “connecting the dots” across dif- ferent subfields due to the universality of underlying phys- ical processes at various scales. This feature fosters the hypothesis that extensive in-context prompting could signif- icantly enhance hypothesis generation if LLMs are initially exposed to a broad range of literature. Lastly, astronomy’s longstanding “open sky” policy makes it an ideal candidate for in-context prompting research. This policy ensures that most data sets are publicly available im- mediately or after a short proprietary period (Almeida et al., 2023; Fabricius et al., 2021). Further, the field possesses a comprehensive, well-curated literature database. The inter- net has enabled the archiving of astronomical knowledge, 1https://python.langchain.com 1 arXiv:2306.11648v1 [astro-ph.IM] 20 Jun 2023
  43. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI (in science) is here Experimentation ‘AI Scientist’ going from idea → ‘reviewed’ paper Generates ideas, writes code, executes experiments, analyzes results, authors paper and simulates reviews. Costs about $15 per outputted paper. Experiments currently limited to machine learning Limited to domains where ‘experiment’ is code/numerical-based (di ff usion modeling, learning dynamics, transformer-based language modeling). Results can be interesting, but with many caveats Creates interesting ideas but struggles with implementation, rigor, and accuracy due to limitations in computation, experimental depth, and current model capabilities*. 2024-9-4 The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery Chris Lu1,2,*, Cong Lu3,4,*, Robert Tjarko Lange1,*, Jakob Foerster2,Ü , Je Clune3,4,5,Ü and David Ha1,Ü *Equal Contribution, 1Sakana AI, 2FLAIR, University of Oxford, 3University of British Columbia, 4Vector Institute, 5Canada CIFAR AI Chair, ÜEqual Advising One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models (LLMs) to perform research independently and communicate their findings. We introduce T AI S , which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion and add them to a growing archive of knowledge, acting like the human scientific community. We demonstrate the versatility of this approach by applying it to three distinct subfields of machine learning: di usion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a meager cost of less than $15 per paper, illustrating the potential for our framework to democratize research and significantly accelerate scientific progress. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. T AI S can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless a ordable creativity and innovation can be unleashed on the world’s most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist. 1. Introduction The modern scientific method (Chalmers, 2013; Dewey, 1910; Jevons, 1877) is arguably one of the greatest achievements of the Enlightenment. Traditionally, a human researcher collects background knowledge, drafts a set of plausible hypotheses to test, constructs an evaluation procedure, collects evidence for the di erent hypotheses, and finally assesses and communicates their findings. Afterward, the resulting manuscript undergoes peer review and subsequent iterations of refinement. This procedure has led to countless breakthroughs in science and technology, improving human quality of life. However, this iterative process is inherently limited by human researchers’ ingenuity, background knowledge, and finite time. Attempting to automate general scientific discovery (Langley, 1987, 2024; Waltz and Buchanan, 2009) has been a long ambition of the community since at least the early 70s, with computer-assisted works like the Automated Mathematician (Lenat, 1977; Lenat and Brown, 1984) and DENDRAL (Buchanan and Feigenbaum, 1981). In the field of AI, researchers have envisioned the possibility of automating AI research using AI itself (Ghahramani, 2015; Schmidhuber, 1991, 2010a,b, 2012), leading to “AI-generating algorithms” (Clune, 2019). More recently, foundation models have seen tremendous advances in their general capabilities (Anthropic, 2024; Google DeepMind Gemini Team, 2023; Llama Team, 2024; OpenAI, 2023), but they have only been shown to accelerate individual parts of the research pipeline, e.g. the writing of scientific manuscripts (Altmäe et al., 2023; Corresponding author(s): Chris Lu ([email protected]), Cong Lu ([email protected]), and Robert Tjarko Lange ([email protected]) arXiv:2408.06292v3 [cs.AI] 1 Sep 2024 * “GPT-4o in particular frequently fails to write LaTeX that compiles.”
  44. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Generative AI (in science) is here Evaluation
  45. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Making predictions What might the future look like for science?
  46. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Making predictions (Lots) more science outputs AI agents capable of ‘doing science’ will mature Agents will be able to generate and critique hypotheses, source data to test them, generate results, and evaluate the potential value of the results. What we are seeing today is very very early. AI tools will enable more ‘results’ to be generated by more people AI tools will make existing scientists more productive and enable more people to participate in the scienti fi c process. This is already to 50% more code written by developers, but also no/low-code solutions (e.g., CHat Oriented Programming (CHOP*) – “coding via iterative prompt re fi nement”). Cost of generating an output that resembles a novel result will trend to zero Cost today of writing AI-generated paper is around $15, over time this will trend towards zero. Cost to achieve a minimum MMLU scores (e.g.,~40 / ~80) decreasing 10x year over year.
  47. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever Making predictions Possible consequences As the pace of publishing accelerates, journals (in their current form) will break An increasing fraction of scienti fi c outputs will come from AI agents. Humans will not be able to keep up. Traditional mechanisms of sharing (and evaluating) science are going to be insu ffi cient Either we’ll spend all of our time reviewing AI-generated outputs, or we’ll need to fi nd a better way… Papers will become increasingly unsatisfactory way of sharing information Papers are designed for humans to read, include lots of (arguably) unnecessary information, and aren’t machine readable.
  48. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever How will we still do science? By instilling and enshrining 
 practices of Collaborative, 
 Versioned Science.
  49. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever CVS as our lodestar Modular, composable, reusable Standarize frameworks Promote universal frameworks (e.g., FAIR principles for data) to ensure interoperability between AI tools, human researchers, and datasets. Build AI model registries Maintain version-controlled registries of AI models used in research. Each model version should link to speci fi c studies to ensure reproducibility and transparency. Invest in composable research pipelines Lean on existing work fl ow technologies and place them at the heart 
 of AI-enabled research code generation.
  50. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever CVS as our lodestar Transparency and inclusivity Diversify collaboration platforms With results can coming from more diverse participants, the venues in which they convene will need to evolve. Automated credit attribution Mechanisms for automated tracking and recording of contributions will become essential, for both transparency, but also fairness. Open, peer-led governance In order to seek out diverse opinions in the direction of AI-driven research studies.
  51. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever CVS as our lodestar Process automation Automated reproduction checks Automated validation pipelines that rerun experiments, check for statistical signi fi cance, and compare outputs across di ff erent environments. Versioned outputs Increased use of versioning technologies and AI agents to track changes in research outputs, maintaining the latest version of outputs. Focus AI for ‘lower-order’ scienti fi c work Leverage AIs for repetitive tasks such as data cleaning, documenting protocols, literature reviews.
  52. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever CVS as our lodestar Documentation Automated documentation generation Leverage AI to generate comprehensive, standardized documentation of protocols, methods, and results, in multiple forms. AI-generated research tutorials/onboarding Use AI to generate custom tutorials for each study, thereby lowering the time taken for humans to onboard into research domain. Semantic versioning of experiments Similar to software, adopt SemVer concept in science ensuring increased historical traceability. Machine-readable experimental metadata Document whole research cycle into a machine readable metadata to enable AI-driven discovery and cross-disciplinary reuse.
  53. Schmidt Sciences Why Generative AI makes collaborative, versioned science more

    important than ever CVS as our lodestar High-quality review process AI-augmented peer review AI assisting reviewers by fl agging methodological errors, inconsistencies, or potential biases while preserving human judgment. Automated reproducibility checks Ahead of publication (or perhaps even submission), AI tooling automatically re-running experiments to validate results. Community-enabled peer review If outputs are versioned and incremental, perhaps review becomes this way too?