Statistics, Data Mining, and Machine Learning (mostly don't work) in Astronomy

Statistics, Data Mining, and Machine Learning In Astronomy Jake VanderPlas
@jakevdp ACAT 2017

Straightforward application of common techniques often fails. In Astronomy: -
Data are often quite noisy

Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects

Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects - Data are fundamentally image-based, which doesn’t play well with tabular database architectures

We make progress in Astronomy by adapting and extending methods
developed in other fields

Case Studies: Statistics, Data Mining, and Machine Learning

Case Studies: Statistics, Data Mining, and Machine Learning Case 1:
Generalizing the Lomb-Scargle Periodogram* * J. VanderPlas et al 2015

Jake VanderPlas Periodic Analysis Large-Scale Structure: Sesar et al. 2010
Robust detection of periodic variability is important in many areas of Astronomy. Exoplanets: European Space Agency

Jake VanderPlas Jake VanderPlas Lomb-Scargle Periodogram cf. Lomb (1976), Scargle
(1982) Figure: VanderPlas & Ivezic 2015 - Generalization of a Fourier Spectrogram - Effectively assumes a sinusoidal model:

Problem: Lomb-Scargle is not designed for heterogeneous data. For example,
stars observed in multiple bands (i.e. wavelength regions)

Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 5 bands/night
1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)

Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 1 band/night
1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)

Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram”

Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram”
- define a base component which contributes equally to all bands.

Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram
- for each band, add a band component to describe deviation from base model

Jake VanderPlas Jake VanderPlas Putting it all together: The Multiband
Periodogram + = Regularize the band component to drive common variation to the base model.

Jake VanderPlas Jake VanderPlas Multiband Periodogram on realistic survey data
. . . Detects period with high significance when single-band approaches fail!

Jake VanderPlas Jake VanderPlas Statistics: We make progress by “opening
the black box” and specializing or extending standard statistical methods

A Database for Images* * P. Mehta et al 2017

Key question: can scientific image analysis be done at scale
on existing systems? Typical databases optimized for tabular data: Typical astronomy data consists of arrays of pixels. Standard data-mining tools are not built for typical scientific data:

Database architecture purpose-built for computation on multi-dim arrays. Python package
aimed at parallelization of scientific workflows Shared-nothing DBMS developed by members of our UW team Popular in-memory big data system with wide adoption & Python interface System optimized for operations on N-dimensional tensors. We Explored Five Systems:

Key Takeaways: Dask Myria SciDB Spark Tensorflow

Key Takeaways: Scientific pipelines are complex enough that they rarely
map onto built-in primitives for existing big data systems. Sufficient Primitives Dask Myria SciDB Spark Tensorflow N/A

Key Takeaways: In the meantime, seamless support for user-defined functions
(UDFs) is absolutely essential for scientific use-cases Sufficient Primitives Python UDF Support Dask Myria SciDB Spark Tensorflow N/A

Key Takeaways: Sufficient Primitives Support for flexible domain-specific data formats
in pipelines it very important for any nontrivial computational task Python UDF Support Flexible data formats Dask Myria SciDB Spark Tensorflow N/A

Key Takeaways: Sufficient Primitives Ideally, parallel computations & memory usage
should be tuned automatically by the systems. None of the explored systems do this particularly well. Python UDF Support Flexible data formats Automatic tuning Dask Myria SciDB Spark Tensorflow N/A

Key Takeaways: Sufficient Primitives Installation headaches are the easiest way
to drive frustration. Streamlined installation, particularly on the cloud, is a must Python UDF Support Flexible data formats Streamlined Installation Automatic tuning Dask Myria SciDB Spark Tensorflow N/A

Dask Myria SciDB Spark Tensorflow Key Takeaways: Sufficient Primitives A
large and active user & developer community makes solving problems & getting questions answered much easier. Python UDF Support Flexible data formats Streamlined Installation Large User Community Automatic tuning N/A

See our paper for more detailed quantitative breakdown & discussion
https://arxiv.org/abs/1612.02485

Jake VanderPlas Jake VanderPlas Data Mining: There is room for
research in development of databases purpose-built for analysis of scientific imagery.

The Cannon* * M. Ness et al., 2015

Challenge: given spectra, determine labels (e.g. temperature, surface gravity, metal
content, etc.) Image: APOGEE project

Textbook Machine Learning Training Data

Textbook Machine Learning Training Data Model

Textbook Machine Learning Training Data Model + Unknown data

Textbook Machine Learning Training Data Model + Unknown data Predictions

Reality: ML (often) doesn’t work in Astronomy - Most algorithms
don’t suitably handle noise or measurement errors

- Most algorithms don’t suitably handle noise or measurement errors
- Unlabeled data is often statistically distinct from training data (e.g. fainter) Reality: ML (often) doesn’t work in Astronomy

The Cannon: Turning ML Around Observed Data (with noise) ML
Model Multiple Labels (no noise) Observed Data (with noise) ML Model Multiple Labels (no noise) Hard: Easier: Key insight: predict data from labels to create a data-driven generative model & treat label prediction as a least squares inference problem.

Stellar spectra generated from 6 spectra-derived labels (temperature, surface gravity,
metallicity, etc.) true spectra model spectra

Results: much more accurate labels, even for much fainter objects.

Jake VanderPlas Jake VanderPlas Machine Learning: We make progress by
thinking outside the box to adapt existing ML methods to new classes of data

Jake VanderPlas Jake VanderPlas Statistics: Generalizing Lomb-Scargle Data Mining: Image-specific
databases Machine Learning: Turning ML around with The Cannon

Jake VanderPlas Jake VanderPlas Statistics, Data Mining, and Machine Learning
methods, applied naively, are often not well-suited to astronomy. But with some tweaks and some new insights, they can be!

Email: [email protected] Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/
Thank You!

Statistics, Data Mining, and Machine Learning (...

Statistics, Data Mining, and Machine Learning (mostly don't work) in Astronomy

More Decks by Jake VanderPlas

Other Decks in Science

Featured

Transcript