Openness, Reproducibility, Interactivity: a Biased View on the Relation between Science and Computing
Slides for my keynote at the Michigan State University Cyber-Infrastructure days. Note that a good part of the talk were interactive demos of various aspects of IPython.
a Biased View on the Relation between Science and Computing Fernando Pérez http://fperez.org, @fperez_org [email protected] Helen Wills Neuroscience Institute, UC Berkeley CI Days, BEACON Center MSU, East Lansing October 26, 2012
the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing Good computing is now a necessary (though not sufficient!) condition for good science. Computing in science must improve drastically before we can really call it scientific. FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42
the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing Good computing is now a necessary (though not sufficient!) condition for good science. Computing in science must improve drastically before we can really call it scientific. FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 4 / 42
credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42
credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42
credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Lawsuits, resignations, careers destroyed. More importantly: Patients were harmed. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 5 / 42
research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) git clone git@server:/my/grant/repo.git cd repo make nsf-fastlane Everyday research: track your results Collaboration: synchronize multi-author work. Teaching!
Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) git clone git@server:/my/grant/repo.git cd repo make nsf-fastlane Everyday research: track your results Collaboration: synchronize multi-author work. Teaching!
integers Rationals Interval arithmetic Symbolic manipulation FORTRAN Extended precision floating point Text processing Databases Graphical user interfaces Web interfaces Hardware control Multi-language integration Data formats: HDF5, XML, ...
array object. Convenient syntax: c = a+b. Math library that operates on arrays: y = sin(k*t). Basic scientific functionality: Linear algebra FFTs Random number generation
a Scientific Idea (schematically) 1 Individual exploratory work 2 Collaborative development 3 Production work (HPC, cloud, parallel) 4 Publication (with reproducible results!) 5 Education 6 Goto 1. The Problem with most tools Barriers and discontinuities in workflow in between all the steps FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42
a Scientific Idea (schematically) 1 Individual exploratory work 2 Collaborative development 3 Production work (HPC, cloud, parallel) 4 Publication (with reproducible results!) 5 Education 6 Goto 1. The Problem with most tools Barriers and discontinuities in workflow in between all the steps FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 20 / 42
Notebook Format JSON but version control-friendly Easy for machine processing, fixable by hand if need be. Lots of hooks for metadata Not Python-specific (R and Ruby notebooks exist, Julia planned) Produce Markdown, reST, L A TEX, HTML, etc... An open format for sharing, publishing and archiving executable computational work FP (UC Berkeley) Openness, Reproducibility, Interactivity 10/26/12 24 / 42
San Luis Obispo Min Ragan-Kelley - Nuclear Engineering, UC Berkeley Matthias Bussonnier - Physics, Institut Curie, Paris Jonathan March- Enthought Thomas Kluyver - Biology, U. Sheffield Jörgen Stenarson - Elect. Engineering, Sweden. Paul Ivanov - Neuroscience, UC Berkeley. Robert Kern - Enthought Evan Patterson - Physics, Caltech/Enthought Brad Froehle - Mathematics, UC Berkeley Stefan van der Walt - UC Berkeley John Hunter - TradeLink Securities, Chicago. Prabhu Ramachandran - Aerospace Engineering, IIT Bombay. Satra Ghosh- MIT Neuroscience Gaël Varoquaux - Neurospin (Orsay, France) Ville Vainio - CS, Tampere University of Technology, Finland Barry Wark - Neuroscience, U. Washington. Ondrej Certik - Physics, U Nevada Reno Darren Dale - Cornell Justin Riley - MIT Mark Voorhies - UC San Francisco Nicholas Rougier - INRIA Nancy Grand Est Thomas Spura - Fedora project Many more! (~150 commit authors)
Visual Studio integration, Azure (thanks to Shahrokh Mortazavi). DoD/DRC Inc: funding through Sept. 2012 (thanks to Jose Unpingco and Chris Keees). NIH: via NiPy grant NSF: via Sage compmath grant Google: summer of code 2005, 2010. Tech-X Corp., Boulder, CO: Parallel/notebook (previous versions)