$30 off During Our Annual Pro Sale. View Details »

Statistics, Data Mining, and Machine Learning (mostly don't work) in Astronomy

Statistics, Data Mining, and Machine Learning (mostly don't work) in Astronomy

Plenary talk at the 18th International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2017; https://indico.cern.ch/event/567550/)

Jake VanderPlas

August 23, 2017
Tweet

More Decks by Jake VanderPlas

Other Decks in Science

Transcript

  1. Statistics, Data Mining, and
    Machine Learning
    In Astronomy
    Jake VanderPlas @jakevdp
    ACAT 2017

    View Slide

  2. Statistics, Data Mining, and
    Machine Learning
    In Astronomy
    Jake VanderPlas @jakevdp
    ACAT 2017

    View Slide

  3. Straightforward application of common
    techniques often fails. In Astronomy:
    - Data are often quite noisy

    View Slide

  4. Straightforward application of common
    techniques often fails. In Astronomy:
    - Data are often quite noisy
    - Pre-labeled objects are often biased
    toward easy to observe (bright and/or
    nearby) objects

    View Slide

  5. Straightforward application of common
    techniques often fails. In Astronomy:
    - Data are often quite noisy
    - Pre-labeled objects are often biased
    toward easy to observe (bright and/or
    nearby) objects
    - Data are fundamentally image-based,
    which doesn’t play well with tabular
    database architectures

    View Slide

  6. We make progress in Astronomy by
    adapting and extending
    methods developed in other fields

    View Slide

  7. Case Studies:
    Statistics, Data Mining, and
    Machine Learning

    View Slide

  8. Case Studies:
    Statistics, Data Mining, and
    Machine Learning
    Case 1: Generalizing the
    Lomb-Scargle Periodogram*
    * J. VanderPlas et al 2015

    View Slide

  9. Jake VanderPlas
    Periodic Analysis
    Large-Scale Structure:
    Sesar et al. 2010
    Robust detection of periodic variability is
    important in many areas of Astronomy.
    Exoplanets:
    European Space Agency

    View Slide

  10. Jake VanderPlas
    Jake VanderPlas
    Lomb-Scargle Periodogram
    cf. Lomb (1976), Scargle (1982)
    Figure: VanderPlas & Ivezic 2015
    - Generalization of a Fourier Spectrogram
    - Effectively assumes a sinusoidal model:

    View Slide

  11. Problem:
    Lomb-Scargle is not designed
    for heterogeneous data.
    For example, stars observed in multiple
    bands (i.e. wavelength regions)

    View Slide

  12. Jake VanderPlas
    Jake VanderPlas
    Two Naive Multiband Approaches
    5 bands/night
    1. Ignore band distinction and fit a single periodogram to
    all bands.
    (model is highly biased: under-fits the data)
    2. Fit an independent periodogram within each band;
    combine the 2 of all K bands
    (model is too flexible: over-fits the data)

    View Slide

  13. Jake VanderPlas
    Jake VanderPlas
    Two Naive Multiband Approaches
    1 band/night
    1. Ignore band distinction and fit a single periodogram to
    all bands.
    (model is highly biased: under-fits the data)
    2. Fit an independent periodogram within each band;
    combine the 2 of all K bands
    (model is too flexible: over-fits the data)

    View Slide

  14. Jake VanderPlas
    Jake VanderPlas
    Idea: Generalize the model
    “Multiband Periodogram”

    View Slide

  15. Jake VanderPlas
    Jake VanderPlas
    Idea: Generalize the model
    “Multiband Periodogram”
    - define a base
    component which
    contributes equally
    to all bands.

    View Slide

  16. Jake VanderPlas
    Jake VanderPlas
    Idea: Generalize the model
    “Multiband Periodogram
    - for each band, add a
    band component to
    describe deviation
    from base model

    View Slide

  17. Jake VanderPlas
    Jake VanderPlas
    Putting it all together:
    The Multiband Periodogram
    + =
    Regularize the band component to drive
    common variation to the base model.

    View Slide

  18. Jake VanderPlas
    Jake VanderPlas
    Multiband Periodogram
    on realistic survey data . . .
    Detects period with high significance
    when single-band approaches fail!

    View Slide

  19. Jake VanderPlas
    Jake VanderPlas
    Statistics:
    We make progress by
    “opening the black box”
    and specializing or extending
    standard statistical methods

    View Slide

  20. Case Studies:
    Statistics, Data Mining, and
    Machine Learning
    Case 2: A Database for Images*
    * P. Mehta et al 2017

    View Slide

  21. View Slide

  22. Key question: can scientific image analysis be
    done at scale on existing systems?
    Typical databases optimized
    for tabular data:
    Typical astronomy data
    consists of arrays of pixels.
    Standard data-mining tools are not built
    for typical scientific data:

    View Slide

  23. Database architecture purpose-built
    for computation on multi-dim arrays.
    Python package aimed at
    parallelization of scientific workflows
    Shared-nothing DBMS developed by
    members of our UW team
    Popular in-memory big data system
    with wide adoption & Python interface
    System optimized for operations on
    N-dimensional tensors.
    We Explored Five Systems:

    View Slide

  24. Key Takeaways:
    Dask
    Myria
    SciDB
    Spark
    Tensorflow

    View Slide

  25. Key Takeaways:
    Scientific pipelines are complex enough that they
    rarely map onto built-in primitives for existing big data
    systems.
    Sufficient Primitives
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  26. Key Takeaways:
    In the meantime, seamless support for user-defined
    functions (UDFs) is absolutely essential for scientific
    use-cases
    Sufficient Primitives
    Python UDF Support
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  27. Key Takeaways:
    Sufficient Primitives
    Support for flexible domain-specific data formats in
    pipelines it very important for any nontrivial
    computational task
    Python UDF Support
    Flexible data formats
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  28. Key Takeaways:
    Sufficient Primitives
    Ideally, parallel computations & memory usage
    should be tuned automatically by the systems. None
    of the explored systems do this particularly well.
    Python UDF Support
    Flexible data formats
    Automatic tuning
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  29. Key Takeaways:
    Sufficient Primitives
    Installation headaches are the easiest way to drive
    frustration. Streamlined installation, particularly on the
    cloud, is a must
    Python UDF Support
    Flexible data formats
    Streamlined Installation
    Automatic tuning
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  30. Dask
    Myria
    SciDB
    Spark
    Tensorflow
    Key Takeaways:
    Sufficient Primitives
    A large and active user & developer community
    makes solving problems & getting questions
    answered much easier.
    Python UDF Support
    Flexible data formats
    Streamlined Installation
    Large User Community
    Automatic tuning
    N/A

    View Slide

  31. See our paper for more detailed quantitative
    breakdown & discussion
    https://arxiv.org/abs/1612.02485

    View Slide

  32. Jake VanderPlas
    Jake VanderPlas
    Data Mining:
    There is room for research
    in development of databases
    purpose-built for analysis of
    scientific imagery.

    View Slide

  33. Case Studies:
    Statistics, Data Mining, and
    Machine Learning
    Case 3: The Cannon*
    * M. Ness et al., 2015

    View Slide

  34. Challenge: given spectra, determine labels
    (e.g. temperature, surface gravity, metal content, etc.)
    Image: APOGEE project

    View Slide

  35. Textbook Machine Learning
    Training Data

    View Slide

  36. Textbook Machine Learning
    Training Data Model

    View Slide

  37. Textbook Machine Learning
    Training Data Model
    + Unknown data

    View Slide

  38. Textbook Machine Learning
    Training Data Model
    + Unknown data
    Predictions

    View Slide

  39. Reality: ML (often) doesn’t work in Astronomy
    - Most algorithms don’t
    suitably handle noise or
    measurement errors

    View Slide

  40. - Most algorithms don’t
    suitably handle noise or
    measurement errors
    - Unlabeled data is often
    statistically distinct from
    training data (e.g. fainter)
    Reality: ML (often) doesn’t work in Astronomy

    View Slide

  41. The Cannon: Turning ML Around
    Observed Data
    (with noise)
    ML Model
    Multiple Labels
    (no noise)
    Observed Data
    (with noise)
    ML Model
    Multiple Labels
    (no noise)
    Hard:
    Easier:
    Key insight: predict data from labels to create a
    data-driven generative model & treat label
    prediction as a least squares inference problem.

    View Slide

  42. Stellar spectra generated from 6 spectra-derived
    labels (temperature, surface gravity, metallicity, etc.)
    true spectra
    model spectra

    View Slide

  43. Results: much more accurate labels, even
    for much fainter objects.

    View Slide

  44. Jake VanderPlas
    Jake VanderPlas
    Machine Learning:
    We make progress by
    thinking outside the box to
    adapt existing ML methods
    to new classes of data

    View Slide

  45. Jake VanderPlas
    Jake VanderPlas
    Statistics: Generalizing Lomb-Scargle
    Data Mining: Image-specific databases
    Machine Learning: Turning ML around with
    The Cannon

    View Slide

  46. Jake VanderPlas
    Jake VanderPlas
    Statistics, Data Mining, and Machine Learning
    methods, applied naively, are often not
    well-suited to astronomy.
    But with some tweaks and
    some new insights, they can be!

    View Slide

  47. Email: [email protected]
    Twitter: @jakevdp
    Github: jakevdp
    Web: http://vanderplas.com/
    Blog: http://jakevdp.github.io/
    Thank You!

    View Slide