source world Where is this going? The scientific Python ecosystem: open source tools for better computing in science Fernando Pérez http://fperez.org [email protected] Helen Wills Neuroscience Institute, UC Berkeley BioFrontiers, CU Boulder April 2, 2012
source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 2 / 54
source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 3 / 54
source world Where is this going? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing “Big Data”, “Cloud computing”, etc: lots of buzzwords... They will NOT automatically produce good science Good computing is now a necessary (though not sufficient!) condition for good science. The rigor, openness, culture of validation, collaboration and other aspects of science must also become part of scientific computing. FP (UC Berkeley) Python for science 4/2/12 4 / 54
source world Where is this going? Computing: part of the DNA of science Much more than “the third branch” of science An avalanche of experimental quantitative data Biology, genetics, neuroscience, astronomy, climate modeling... All scientists must now do real computing “Big Data”, “Cloud computing”, etc: lots of buzzwords... They will NOT automatically produce good science Good computing is now a necessary (though not sufficient!) condition for good science. The rigor, openness, culture of validation, collaboration and other aspects of science must also become part of scientific computing. FP (UC Berkeley) Python for science 4/2/12 4 / 54
source world Where is this going? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Patients were harmed. Lawsuits, resignations. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Python for science 4/2/12 6 / 54
source world Where is this going? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Patients were harmed. Lawsuits, resignations. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Python for science 4/2/12 6 / 54
source world Where is this going? A crisis of credibility and real issues The Duke clinical trials scandal - Potti/Nevin A compounding of (common and otherwise) data analysis errors. No materials allowing validation/reproduction of results. Patients were harmed. Lawsuits, resignations. Major policy reviews and changes: NCI, IOM, ... More: see K. Baggerly’s "starter set" page. The Duke situation is more common than we’d like to believe! Begley & Ellis, Nature, 3/28/12: Drug development: Raise standards for preclinical cancer research. 47 out of 53 “landmark papers” could not be replicated. Nature, Feb 2012, Ince et al: The case for open computer programs “The scientific community places more faith in computation than is justified” “anything less than the release of actual source code is an indefensible approach for any scientific results that depend on computation” FP (UC Berkeley) Python for science 4/2/12 6 / 54
source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54
source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54
source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54
source world Where is this going? Related changes: Open * Internet: interactions for humans, code and data Open Source Software development akin to scientific culture viable alternatives to proprietary software tools and lessons for improving the scientific process: Github Open Access thecostofknowledge.org: Elsevier boycott FRPAA House hearing on March 29th. Open Education MIT Open Courseware, Khan Academy... Stanford CS 221 in fall 2011: ~160,000 students. Spring 2012: Sebastian Thrun leaves Stanford: Udacity. Stanford: Coursera. MITx, TED-Ed... FP (UC Berkeley) Python for science 4/2/12 7 / 54
source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 8 / 54
Applied Math, CU Boulder. Fast application of integral kernels. (Partial Differential Equations) Implementation went from 1 to 3 dimensions directly (extremely unusual). Complex algorithm: beyond pure numerics. Very good performance, thanks to NumPy, F2PY and weave. Dynamically generated C++ sources: code as a run-time resource. Nnod = 10, ǫ = 1.0e − 10, Nblocks = 445
source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 12 / 54
integers Rationals Interval arithmetic Symbolic manipulation FORTRAN Extended precision floating point Text processing Databases Graphical user interfaces Web interfaces Hardware control Multi-language integration Data formats: HDF5, XML, ...
source world Where is this going? Python in this context Open Source, free, highly portable. Extremely readable: “executable pseudo-code”. Simple: “fits your brain”. Rich types and library: “batteries included” Easy to wrap C, C++ and FORTRAN. NumPy: IDL/Matlab-like arrays. FP (UC Berkeley) Python for science 4/2/12 14 / 54
array object. Convenient syntax: c = a+b. Math library that operates on arrays: y = sin(k*t). Basic scientific functionality: Linear algebra FFTs Random number generation
source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 24 / 54
source world Where is this going? IPython: Interactive Scientific Computing A CU Boulder project Started when I was a graduate student in Physics (2001). Continued as a postdoc in Applied Mathematics. Brian Granger: CU Physics. In brief 1 A better Python shell 2 Embeddable Kernel and powerful interactive clients 1 Terminal 2 Qt console 3 Web notebook 3 Flexible parallel computing FP (UC Berkeley) Python for science 4/2/12 25 / 54
source world Where is this going? IPython: Interactive Scientific Computing A CU Boulder project Started when I was a graduate student in Physics (2001). Continued as a postdoc in Applied Mathematics. Brian Granger: CU Physics. In brief 1 A better Python shell 2 Embeddable Kernel and powerful interactive clients 1 Terminal 2 Qt console 3 Web notebook 3 Flexible parallel computing FP (UC Berkeley) Python for science 4/2/12 25 / 54
(previous versions) Microsoft: WinHPC support, Visual Studio integration NIH: via NiPy grant NSF: via Sage compmath grant Google: summer of code 2005, 2010. DoD/HPTi.
Luis Obispo Physics Min Ragan-Kelley - UC Berkeley Nuclear engineering. Thomas Kluyver - U. Sheffield Plant biology Jörgen Stenarson - SP Technical Research Institute of Sweden Paul Ivanov - UC Berkeley neuroscience Robert Kern - Enthought Evan Patterson - Caltech Physics/Enthought Stefan van der Walt - UC Berkeley John Hunter - TradeLink Securities, Chicago. Prabhu Ramachandran - Aerospace Engineering, IIT Bombay Satra Ghosh- MIT Neuroscience Gaël Varoquaux - Neurospin (Orsay, France) Ville Vainio - CS, Tampere University of Technology, Finland Barry Wark - Neuroscience, U. Washington. Ondrej Certik - Physics, U Nevada Reno Darren Dale - Cornell Justin Riley - MIT Mark Voorhies - UC San Francisco Nicholas Rougier - INRIA Nancy Grand Est Thomas Spura - Fedora project Julian Taylor - Debian/Ubuntu Many more! (~140 commit authors)
source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 35 / 54
research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
research practices! Reproducibility at publication time? It’s already too late. Learn from a community (open source) where reproducibility is an everyday practice (by necessity)
Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) Everyday research: track your results Teaching (never accept an emailed homework assignment again!)
Reproducibility? Tracking and recreating every step of your work In the software world: it’s called Version Control! Git: an enabling technology. Use version control for everything Paper/grant writing (never get paper_v5_john.tex by email again!) Everyday research: track your results Teaching (never accept an emailed homework assignment again!)
Distributed backup: the dog can not eat their homework! They can work from any computer. Easy downloading of all class materials without a million clicks. The end of the email attachment madness. Version control as an natural tool, as common as email.
source world Where is this going? Outline 1 Changes in Science & Computing 2 Two vignettes 3 Scientific Python 4 IPython 5 Lessons from the open source world 6 Where is this going? FP (UC Berkeley) Python for science 4/2/12 48 / 54
source world Where is this going? IPython and the lifecycle of scientific ideas Individual exploration Collaboration “Google docs with a brain” Large-scale parallel production work IPython notebook on Amazon EC2: MIT’s StarCluster Publication Generation of HTML/PDF/EPub... “Executable papers” Education Workshops and bootcamps (UC Berkeley, elsewhere) FP (UC Berkeley) Python for science 4/2/12 50 / 54
source world Where is this going? The executable paper: Titus Brown (MSU), 3/21/12 http://arxiv.org/abs/1203.4802 FP (UC Berkeley) Python for science 4/2/12 51 / 54
US publisher) A full book on brain imaging and statistics (JB Poline - Neurospin). DoD - classic HPC environments. Notebook: a format beyond Python (R, matlab, etc...) UK: Python in education and the Raspberry Pi. Numfocus.org: a foundation interface with industry. support open source scientific Python produce educational materials Github.com: collaborations on ’versioned science’.
science... So we must also change: Improve our computational praxis Better educate our students Acknowledge computational work alongside other metrics of academic work.