Intro to Pydata

An Introduction to the PyData World Jake VanderPlas @jakevdp Index
Conf 2018

$ whoami jakevdp

Code: Books: $ whoami jakevdp Blog: http://jakevdp.github.io

History: how Python led to PyData ~ Tools: Getting to
know the landscape

Python is not a data science language.

Python was created in the 1980s as a teaching language,
and to “bridge the gap between the shell and C” 1 1. Guido Van Rossum The Making of Python

“I thought we'd write small Python programs, maybe 10 lines,
maybe 50, maybe 500 lines — that would be a big one” Guido Van Rossum The Making of Python

How did Python become a data science powerhouse?

1990s: The Scripting Era * yes, this is overly simplified
. .

1990s: The Scripting Era Motto: “Python as Alternative to Bash”
* yes, this is overly simplified . .

“Scientists... work with a wide variety of systems ranging from
simulation codes, data analysis packages, databases, visualization tools, and home-grown software-each of which presents the user with a different set of interfaces and file formats. As a result, a scientist may spend a considerable amount of time simply trying to get all of these components to work together in some manner...” - David Beazley Scientific Computing with Python (ACM vol. 216, 2000) 1990s: The Scripting Era

“Simplified Wrapper and Interface Generator” (SWIG) http://www.swig.org/ 1990s: The Scripting
Era

1990s: The Scripting Era 2000s: The SciPy Era * yes,
this is overly simplified . .

1990s: The Scripting Era 2000s: The SciPy Era Motto: “Python
as Alternative to MatLab” * yes, this is overly simplified . .

“I had a hodge-podge of work processes. I would have
Perl scripts that called C++ numerical routines that would dump data files, and I would load them up into MatLab to plot them. After a while I got tired of the MatLab dependency… so I started loading them up in GnuPlot.” -John Hunter creator of Matplotlib SciPy 2012 Keynote 2000s: The SciPy Era

“Prior to Python, I used Perl (for a year) and
then Matlab and shell scripts & Fortran & C/C++ libraries. When I discovered Python, I really liked the language... But, it was very nascent and lacked a lot of libraries. I felt like I could add value to the world by connecting low-level libraries to high-level usage in Python.” - Travis Oliphant creator of NumPy & SciPy via email, 2015 2000s: The SciPy Era

2000s: The SciPy Era “I remember looking at my desk,
and seeing all the books on languages I had. I literally had a stack with books on C, C++, Unix utilities (awk/sed/sh/etc), Perl, IDL manuals, the Mathematica book, Make printouts, etc. I realized I was probably spending more time switching between languages than getting anything done..” - Fernando Perez creator of IPython via email, 2015

Released circa 2002 Released circa 2000 Released circa 2001 2000s:
The SciPy Era 1995 2002 Numarray Numeric (Early array libraries) Key Software Development:

Com putation Visualization Shell Originally, the three projects each had
much wider scope: 2000s: The SciPy Era Numarray Numeric Array Manipulation

Shell Com putation Visualization With time, the projects narrowed their
focus: 2000s: The SciPy Era Unified Array Library Underneath

1990s: The Scripting Era 2000s: The SciPy Era 2010s: The
PyData Era * yes, this is overly simplified . .

PyData Era Motto: “Python as Alternative to R” * yes, this is overly simplified . .

2010s: The PyData Era “I had a distinct set of
requirements that were not well-addressed by any single tool at my disposal: - Data structures with labeled axes . . . - Integrated time series functionality . . . - Arithmetic operations and reductions . . . - Flexible handling of missing data - Merge and other relational operations . . . I wanted to be able to do all these things in one place, preferably in a language well-suited to general purpose software development” - Wes McKinney creator of Pandas (in Python for Data Analysis)

Key Software Development: 2010s: The PyData Era 2011: Labeled data
2010: Machine Learning 2012: Packaging 2012: Compute Environment 2015: polyglot notebook

PyData Era Motto: “Python as Alternative to R” Motto: “Python as Alternative to MatLab” Motto: “Python as Alternative to Bash” * yes, this is all overly simplified . . .

People want to use Python because of its intuitiveness, beauty,
philosophy, and readability.

People want to use Python because of its intuitiveness, beauty,
philosophy, and readability. So people build Python packages that incorporate lessons learned in other tools & communities.

A Quick Tour of the PyData World . . .

Installation Conda is a cross-platform package and dependency manager, focused
on Python for scientific and data-intensive computing, It comes in two flavors: - Miniconda is a minimal install of the conda command-line tool - Anaconda is miniconda plus hundreds of common packages. I recommend Miniconda. http://conda.pydata.org/

Installation Anaconda and Miniconda are both available for a wide
range of operating systems. http://conda.pydata.org/

$ bash ~/Downloads/Miniconda3-latest-MacOSX-x86_64.sh Welcome to Miniconda3 4.3.21 (by Continuum Analytics,
Inc.) In order to continue the installation process, please review the license agreement. Please, press ENTER to continue >>> Installation Miniconda is a lightweight installation (~25MB) that gives you access to the conda package management tool. It creates a sandboxed Python installation, entirely disconnected from your system Python. http://conda.pydata.org/

$ which conda /Users/jakevdp/anaconda/bin/conda $ which python /Users/jakevdp/anaconda/bin/python $ python
Python 3.5.1 |Continuum Analytics, Inc.| (default ... Type "help", "copyright", "credits" or "license" ... >>> print("hello world") hello world Installation Both conda and python now point to the executables installed by miniconda. http://conda.pydata.org/

$ conda install numpy scipy pandas matplotlib jupyter Fetching package
metadata ......... Solving package specifications: . Package plan for installation in environment /Users/jakevdp/anaconda/: The following NEW packages will be INSTALLED: appnope: 0.1.0-py36_0 bleach: 1.5.0-py36_0 cycler: 0.10.0-py36_0 decorator: 4.0.11-py36_0 Installation Installation of new packages can be done seamlessly with conda install http://conda.pydata.org/

$ conda create -n py2.7 python=2.7 numpy=1.13 scipy Fetching package
metadata ......... Solving package specifications: . Package plan for installation in environment /Users/jakevdp/anaconda/envs/py2.7: The following NEW packages will be INSTALLED: mkl: 2017.0.3-0 numpy: 1.13.0-py27_0 openssl: 1.0.2l-0 pip: 9.0.1-py27_1 Installation New sandboxed environments can be created with specific versions of Python and its packages. Here we create an environment named py2.7 with Python 2.7 http://conda.pydata.org/

$ conda activate python2.7 (python2.7) $ which python /Users/jakevdp/anaconda/envs/python2.7/bin/python (python2.7)
$ python --version Python 2.7.11 :: Continuum Analytics, Inc. Installation By “activating” the environment, we can now use this different Python version with a different set of packages. You can create as many of these environments as you’d like. http://conda.pydata.org/

Installation I tend to use conda envs for just about
everything, particularly when testing development versions of projects I contribute to. $ conda env list # conda environments: # astropy-dev /Users/jakevdp/anaconda/envs/astropy-dev jupyterlab /Users/jakevdp/anaconda/envs/jupyterlab python2.7 /Users/jakevdp/anaconda/envs/python2.7 python3.3 /Users/jakevdp/anaconda/envs/python3.3 python3.4 /Users/jakevdp/anaconda/envs/python3.4 python3.5 /Users/jakevdp/anaconda/envs/python3.5 python3.6 /Users/jakevdp/anaconda/envs/python3.6 scipy-dev /Users/jakevdp/anaconda/envs/scipy-dev sklearn-dev /Users/jakevdp/anaconda/envs/sklearn-dev vega-dev /Users/jakevdp/anaconda/envs/vega-dev root /Users/jakevdp/anaconda http://conda.pydata.org/

Installation 1. https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/ So… what about pip? In brief: “pip
installs python packages within any environment; conda installs any package within conda environments” For many more details on the distinctions, see my blog post, Conda: Myths and Misconceptions.1

Coding Environment: $ conda install jupyter notebook http://jupyter.org/

Coding Environment: $ jupyter notebook [I 06:32:22.641 NotebookApp] Serving notebooks
from local directory: /Users/jakevdp [I 06:32:22.641 NotebookApp] 0 active kernels [I 06:32:22.641 NotebookApp] The IPython Notebook is running at: http://localhost:8888/ [I 06:32:22.642 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). http://jupyter.org/

Coding Environment: http://jupyter.org/ JupyterLab has recently been released: making the
notebook one component of a full-featured IDE.

Numerical Computation: $ conda install numpy http://www.numpy.org/

Numerical Computation: NumPy provides the ndarray object which is useful
for storing and manipulating numerical data arrays. import numpy as np x = np.arange(10) print(x) [0 1 2 3 4 5 6 7 8 9] Arithmetic and other operations are performed element-wise on these arrays: print(x * 2 + 1) [ 1 3 5 7 9 11 13 15 17 19] http://www.numpy.org/

Numerical Computation: Also provides essential tools like pseudo-random numbers, linear
algebra, Fast Fourier Transforms, etc. M = np.random.rand(5, 10) # 5x10 random matrix u, s, v = np.linalg.svd(M) print(s) [ 4.22083 1.091050 0.892570 0.55553 0.392541] x = np.random.randn(100) # 100 std normal values X = np.fft.fft(x) print(X[:4]) # first four entries [ -7.932434 +0.j -16.683935 -3.997685j 3.229016+16.658718j 2.366788-11.863747j] http://www.numpy.org/

Numerical Computation: Key to using NumPy (and general numerical code
in Python) is vectorization: x = np.random.rand(10000000) %%timeit y = np.empty(x.shape) for i in range(len(x)): y[i] = 2 * x[i] + 1 1 loop, best of 3: 6.4 s per loop If you write Python like C, you’ll have a bad time: http://www.numpy.org/

in Python) is vectorization: x = np.random.rand(10000000) %%timeit y = 2 * x + 1 10 loops, best of 3: 58.6 ms per loop Use vectorization for readability and speed ~ 100x speedup! http://www.numpy.org/

in Python) is vectorization: x = np.random.rand(10000000) %%timeit y = 2 * x + 1 10 loops, best of 3: 58.6 ms per loop Use vectorization for readability and speed https://www.youtube.com/watch?v=EEUXKG97YRw https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015 ~ 100x speedup! For a more complete intro to vectorization in NumPy, see Losing Your Loops: Fast Numerical Computation in Python (my talk at PyCon 2015)

Labeled Data: $ conda install pandas http://pandas.pydata.org

Labeled Data: Pandas provides a DataFrame object which is like
a NumPy array, but has labeled rows and columns: import pandas as pd df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) print(df) x y 0 1 4 1 2 5 2 3 6 http://pandas.pydata.org

Labeled Data: Like NumPy, arithmetic is element-wise, but you can
access and augment the data using column name: df['x+2y'] = df['x'] + 2 * df['y'] print(df) x y x+2y 0 1 4 9 1 2 5 12 2 3 6 15 http://pandas.pydata.org

Labeled Data: Pandas excels in reading data from disk in
a variety of formats. Start here to read virtually any data format! # contents of data.csv name, id peter, 321 paul, 605 mary, 444 name id 0 peter 321 1 paul 605 2 mary 444 df = pd.read_csv('data.csv') print(df) http://pandas.pydata.org

Labeled Data: Pandas also provides fast SQL-like grouping & aggregation:
id val 0 A 1 1 B 2 2 A 3 3 B 4 df = pd.DataFrame({'id': ['A', 'B', 'A', 'B'], 'val': [1, 2, 3, 4]}) print(df) val id A 4 B 6 grouped = df.groupby('id').sum() print(grouped) http://pandas.pydata.org

Visualization: $ conda install matplotlib http://www.matplotlib.org/

Visualization: Matplotlib was developed as a Pythonic replacement for MatLab;
thus MatLab users should find it quite familiar: import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 10, 1000) plt.plot(x, np.sin(x)) plt.plot(x, np.cos(x)) http://www.matplotlib.org/

Visualization Beyond Matplotlib . . . Pandas offers a simplified
Matplotlib Interface: data = pd.read_csv('iris.csv') data.plot.scatter('petalLength', 'petalWidth') http://pandas.pydata.org

Visualization Beyond Matplotlib . . . PdVega gives a similar
interface to Vega-Lite: import pdvega # import makes vgplot attribute available data.vgplot.scatter('petalLength', 'petalWidth') http://jakevdp.github.io/pdvega;/

Visualization Beyond Matplotlib . . . Seaborn is a package
for statistical data visualization seaborn.pairplot(data, hue='species') http://seaborn.pydata.org/

Visualization Beyond Matplotlib . . . Bokeh: interactive visualization in
the browser. http://bokeh.pydata.org/

Visualization Beyond Matplotlib . . . Plotly: “modern platform for
data science” http://plotly.com/

(ggplot(mtcars, aes('wt', 'mpg', color='factor(gear)')) + geom_point()) + stat_smooth(method='lm') + facet_wrap('~gear'))
Visualization Beyond Matplotlib . . . plotnine: grammar of graphics in Python http://plotnine.readthedocs.io/

Visualization Beyond Matplotlib . . . Viz in Python is
a huge and rapidly-developing space: See my PyCon 2017 talk, Python’s Visualization Landscape https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017 https://www.youtube.com/watch?v=FytuB8nFHPQ

Numerical Algorithms: $ conda install scipy SciPy http://www.scipy.org/

Numerical Algorithms: SciPy SciPy contains almost too many to demonstrate:
e.g. scipy.sparse sparse matrix operations scipy.interpolate interpolation routines scipy.integrate numerical integration scipy.spatial spatial metrics & distances scipy.stats statistical functions scipy.optimize minimization & optimization scipy.linalg linear algebra scipy.special special mathematical functions scipy.fftpack Fourier & related transforms Most functionality comes from wrapping Netlib & related Fortran libraries, meaning it is blazing fast. http://www.scipy.org/

Numerical Algorithms: SciPy import matplotlib.pyplot as plt import numpy as
np from scipy import special, optimize x = np.linspace(0, 10, 1000) opt = optimize.minimize(special.j1, x0=3) plt.plot(x, special.j1(x)) plt.plot(opt.x, special.j1(opt.x), marker='o', color='red') http://www.scipy.org/

Machine Learning: $ conda install scikit-learn http://scikit-learn.org/ Scikit-learn features a
well-defined, extensible API for the most popular machine learning algorithms:

http://scikit-learn.org/ x = 10 * np.random.rand(100) y = np.sin(x) +
0.1 * np.random.randn(100) plt.plot(x, y, '.k') Make some noisy 1D data for which we can fit a model: Machine Learning with scikit-learn

http://scikit-learn.org/ from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor() model.fit(x[:, np.newaxis],
y) xfit = np.linspace(-1, 11, 1000) yfit = model.predict(xfit[:, np.newaxis]) plt.plot(x, y, '.k') plt.plot(xfit, yfit) Fit a random forest regression: Machine Learning with scikit-learn

Machine Learning with scikit-learn http://scikit-learn.org/ from sklearn.svm import SVR model
= SVR() model.fit(x[:, np.newaxis], y) xfit = np.linspace(-1, 11, 1000) yfit = model.predict(xfit[:, np.newaxis]) plt.plot(x, y, '.k') plt.plot(xfit, yfit) Fit a support vector regression:

Machine Learning with scikit-learn http://scikit-learn.org/ from sklearn.svm import SVR model
= SVR() model.fit(x[:, np.newaxis], y) xfit = np.linspace(-1, 11, 1000) yfit = model.predict(xfit[:, np.newaxis]) plt.plot(x, y, '.k') plt.plot(xfit, yfit) Fit a support vector regression: Scikit-learn’s strength: provides a uniform API for the most common machine learning methods.

Parallel Computation: $ conda install dask http://dask.pydata.org/ Dask is a
lightweight tool for creating task graphs that can be executed on a variety of backends.

Parallel Computation: http://dask.pydata.org/ import numpy as np a = np.random.randn(1000)
b = a * 4 b_min = b.min() print(b_min) -13.2982888603 Typical data manipulation with NumPy:

Parallel Computation: http://dask.pydata.org/ import dask.array as da a2 = da.from_array(a,
chunks=200) b2 = a2 * 4 b2_min = b2.min() print(b2_min) dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=()> Same operation with dask

chunks=200) b2 = a2 * 4 b2_min = b2.min() print(b2_min) dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=()> Same operation with dask “Task Graph”

chunks=200) b2 = a2 * 4 b2_min = b2.min() print(b2_min) dask.array<amin-aggregate, shape=(), dtype=float64, chunksize=()> Same operation with dask b2_min.compute() -13.298288860312757

Code Optimization $ conda install numba http://numba.pydata.org/ Numba is a
bytecode compiler that can convert Python code to fast LLVM code targeting a CPU or GPU. Numba

Code Optimization http://numba.pydata.org/ Numba Simple iterative functions tend to be
slow in Python: def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b return a %timeit fib(10000) # ipython “timeit magic” 100 loops, best of 3: 2.73 ms per loop

Code Optimization http://numba.pydata.org/ Numba import numba @numba.jit def fib(n): a,
b = 0, 1 for i in range(n): a, b = b, a + b return a %timeit fib(10000) # ipython “timeit magic” 100000 loops, best of 3: 6.06 µs per loop With a quick decorator, code can be ~1000x as fast! ~ 500x speedup!

Code Optimization http://numba.pydata.org/ Numba Numba achieves this by just-in-time (JIT)
compilation of the Python function to LLVM byte-code. import numba @numba.jit def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b return a %timeit fib(10000) # ipython “timeit magic” 100000 loops, best of 3: 6.06 µs per loop With a quick decorator, code can be ~1000x as fast! ~ 500x speedup!

Code Optimization $ conda install cython http://www.cython.org/ Cython is a
superset of the Python language that can be compiled to fast C code.

Code Optimization http://www.cython.org/ Again, returning to our fib function: #
python code def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b return a 100 loops, best of 3: 2.73 ms per loop %timeit fib(10000)

Code Optimization http://www.cython.org/ Cython compiles the code to C, giving
marginal speedups without even changing the code: %%cython def fib(n): a, b = 0, 1 for i in range(n): a, b = b, a + b return a 100 loops, best of 3: 2.42 ms per loop %timeit fib(10000) ~ 10% speedup!

Code Optimization http://www.cython.org/ Using cython’s syntactic sugar to specify types
for the compiler leads to much better performance: %%cython def fib(int n): cdef int a = 0, b = 1 for i in range(n): a, b = b, a + b return a 100000 loops, best of 3: 5.93 µs per loop %timeit fib(10000) ~ 500x speedup!

Powered by Cython: http://www.cython.org/ The PyData stack is largely powered
by Cython: SciPy . . . and many more.

Python is not a data science language. ~ And this
may be its greatest strength.

Email: [email protected] Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/
Thank You!

Intro to Pydata

Intro to Pydata

More Decks by Jake VanderPlas

Other Decks in Programming

Featured

Transcript