Programming DS Data Scientist come with different skills and backgrounds Machine Learning Big Data Visualization Analytics HPC CS / Programming DS Machine Learning Big Data Visualization Analytics HPC CS / Programming DS Statistician / Analyst Research / Computational Scientist Developer / Engineer
• formed by team members with very diverse backgrounds • both in terms of knowledge (CS, Statistics, Viz, ML…) • and technology stacks (R, SAS, Python…) How can companies organize efficiently in this environment?
movement that makes open source tools for data science -- data, analytics, & computation – easily work together as a connected ecosystem 8 Open Data Science
Vibrant and Growing Community 9 Python Community 30M+ ANACONDA Downloads* 3M+ Packages in Anaconda 720+ R Community 16M+ Spark Python Usage 50%+ * As of Dec 2015. Another 2.7M download YTD
Biz Analyst Data Engineer Developer DevOps Deploy & Operate Explore & Analyze Collaborate & Publish Data Scientists are not the only player in the Data Science Team
Web Services Data Warehouse HDFS Streaming Data Flat Files NoSQL Model Building Integrate DEPLOY OPERATE Cloud Computing Web Services On-Premise Internal Cluster
Distribute, share and publish Data Science assets • Get diverse data scientists (languages, tools, data models, assets…) to collaborate effectively • Enable Data Scientists to easily leverage Big Data technologies • Deploy data science assets into production applications • Share insights with decision makers • Enable Business Analysts and Managers to leverage Data Science
Distribution • Anaconda Community Innovation • Jupyter, JupyterLab and extensions • Bokeh for interactive data visualizations • Datashader for large scale visualizations • Dask for parallel computing • Numba for high performance computing • Anaconda Enterprise
/ IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 150+ packages conda PYTHON cond conda • Anaconda distribution: Python distribution that includes 150+ packages for data science (in the installer) • Miniconda: Lightweight version of Anaconda, with just Python and conda. • Anaconda Cloud: Cloud service to host and share public (free) and private data science assets • Anaconda Navigator: Anaconda distribution UI to manage environments, launch applications and learn about what’s happening in the community Anaconda distribution Miniconda
manager • conda-forge: A community led collection of recipes, build infrastructure and distributions for the conda package manager • conda environments: custom isolated sandboxes to easily reproduce and share data science projects • conda kapsel: reproducible, executable project directories
Pandas, Scikit-learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 150+ packages conda PYTHON cond conda Anaconda distribution Miniconda • Easy to install on all platforms • Language agnostic - Python, R, Scala… • Trusted by industry leaders • Trusted by the community - Large user base: 3M+ downloads • BSD license • Extensible - easily build, share and install proprietary libraries with Anaconda Cloud • Allows isolated custom sandboxes with different versions of packages - conda environments • Allows for easy encapsulation and deployment of data science assets - conda kapsel
JupyterLab and extensions • Bokeh for interactive data visualizations • Datashader for large scale visualizations • Dask for parallel computing • Anaconda Enterprise
Dask Datashader • Web interactive data visualizations (no JS) • Graphics pipeline system for creating meaningful representations of large amounts of data • Parallel computing framework • Next generation Data Science IDE JupyterLab
dropdown inside notebook UI to switch between conda envs • nb_conda: help manage conda envs from inside file viewer of jupter notebook nb_condakernel nb_conda
for presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates. http://bokeh.pydata.org/en/latest/
large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader NYC census data by race
deliver information about the current state of the network helps to track progress, identify performance issues, and debug failures over a normal web page in real time.
ANACONDA Repository ANACONDA Accelerate ANACONDA Distribution ANACONDA Scale Open Data Science Core Open Data Science Repository High Performance Computing Distributed Computing ANACONDA Enterprise Notebooks Data Science Collaboration ANACONDA Mosaic Heterogeneous Data Exploration ANACONDA Fusion Excel Data Science
Distribute Data Science assets • Get diverse data scientists (languages, tools, data models, deliverables…) to collaborate effectively • Enable Data Scientists to easily leverage Big Data technologies • Deploy data science assets into production applications • Share insights with decision makers