1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. ! In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999 Plotting?? ! Gist XPLOT DISLIN Gnuplot Helping with f2py
Virtanen • Robert Kern • Warren Weckesser • Ralf Gommers • Mark Wiebe • Nathaniel Smith • Nathan Bell • Stefan van der Walt • Matthew Brett • Josef Perktold …
• Often lonely — initially nobody believes in your idea more than you do. Others need some “proof” before they join you. • The more complicated what you are doing is the lonelier it will be initially. • Examples: • I procrastinated my PhD at least 1 year to create the beginnings of SciPy (don’t tell my wife). • Pearu Peterson put in tremendous work to create f2py and scipy.linalg • I spent 18 months not publishing papers to write NumPy (despite many people telling me it was foolish).
is everything (sometimes you are the right person for the job) • Having an urgency (it won’t wait) • Striving for excellence Give the best you have…and it will never be enough. Give your best anyway. — Mother Teresa
need help to achieve your goals. • This means other people. This will require sacrificing some of your ego to really listen! • Someone will point out how you suck (listen to them, you probably do). • Nurture empathy. • Treat other people like they matter to you — the only successful way to do that is to actually care. This exposes you to get hurt — care about people anyway! • Much more could be said on this topic…
than scattered • contiguous is better than strided • descriptive is better than imperative • array-oriented is better than object-oriented • broadcasting is a great idea • vectorized is better than an explicit loop • unless it’s too complicated --- then use Cython/Numba • think in higher dimensions
• Originated by Ken Iverson • Direct descendants (J, K, Matlab) are still used heavily and people pay a lot of money for them • NumPy is a descendent APL J K Matlab Numeric NumPy Q
GPUS) were made for array-oriented computing. • The software stack has just not caught up --- unfortunate because APL came out in 1963. • There is a reason Fortran remains popular.
naturally array-oriented (easy to vectorize) • Algorithms can be expressed at a high-level • These algorithms can be parallelized more simply (quite often much information is lost in the translation to typical “compiled” languages) • Array-oriented algorithms map to modern hard-ware caches and pipelines.
object with some user- friendly features. • NumPy “arrays of structures” can be used to handle arbitrary data. http://people.rit.edu/blbgse/pythonNotes/numpy.html First Name Last Name Score Dave Thomas 89.4 Tasha Hen 76.6 Cool Python 100 Stack Overflow 95.32 Py Py 75
(indexes) • Easy manipulation of new columns • Missing Value handling • Time Series handling • General Split-Apply-Combine • Merge and Join • Integrated Plotting • Chained method calls the norm • Familiar to R users — more user-friendly features!
of your way) • White space preserves “visual real-estate” • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Packaging with conda is awesome!
(functional, object-oriented, scripts, etc.) • Experienced programmers can also use it effectively (classes, meta-programming techniques) • Has a simple, extensible implementation (so that C-extensions can exist) • General-purpose language --- can build a system • Critical mass! • Allows for community
Some syntax warts (1:10:2 outside [ ] please) • The CPython run-time is aged and needs an overhaul (GIL, global variables, lack of dynamic compilation support) • No approach to language extension except for abuse of meta-programming and “import hooks” (lightweight DSL need) • The distraction of multiple run-times... • Array-oriented and NumPy not really understood by many Python devs (but thank you for ‘@‘)
System (including structures) • C-API • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Broadcasting • Easy to interface C/C++/Fortran code • PEP 3118
to extend • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPU • Code-base is organic and hard to extend (already inherited from Numeric)
Basically the “structure” of NumPy arrays as a protocol in Python itself to establish a memory-sharing standard between objects. It makes it possible for a heterogeneous world of powerful array-like objects outside of NumPy that communicate. ! Falls short in not defining a general data description language (DDL). http://python.org/dev/peps/pep-3118/
provide a different view on encapsulation (the dual of typical encapsulation approaches for those initiated to linear algebra) ! • Rather than attach methods to data and hide the data, you describe data completely and declaratively. You can then later apply arbitrary code to this data at run-time. ! • It makes it much easier to move code to data at all levels.
be heterogenous • There will be many projects - Biggus - DistArray - SciDB - Elemental - Spartan • Not counting integration with the other projects - Spark - Imapala - Disco
allows you to have multiple versions of packages installed in system • Easy app-deployment • Taming open-source • Users love it! Free for all users Enterprise support available!
Hundreds of data formats - Basic programs expect all data to fit in memory - Data analysis pipelines constantly changing from one form to another - Sharing analysis contains significant overhead to configure systems - Parallelizing analysis requires expert in particular distributed computing stack Data Pain
Select NYC Find Tech Selloff Plot • Lazy computation to minimize data movement • Simple DAG for compilation to • parallel application • distributed memory • static optimizations
C++ library for multi-dimensional arrays • datashape — general data-description language (what PEP 3118 was missing) • Blaze - Data (adapt data from many different silos) - Compute (interpreters to run expressions on backends) - Expr (Symbolic expressions) - Interfaces (Focusing on Tables) • BLZ — experimental storage format (unsupported)
image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
the first libraries I wrote • extended “umath” module by adding new “universal functions” to compute many scientific functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧
10000 loops, best of 3: 75 us per loop ! In [7]: from scipy.special import j0 ! In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!