$30 off During Our Annual Pro Sale. View Details »

Image Analysis At Scale

Image Analysis At Scale

(Presented 2017/07/14 at #SciPy2017)

Scientific discoveries are increasingly driven by the analysis of large volumes of image data, and many tools and systems have emerged to support distributed data storage and scalable computation. It is not always immediately clear, however, how well these systems support real-world scientific use cases. Our team set out to evaluate the performance and ease-of-use of five such systems (SciDB, Myria, Spark, Dask, and TensorFlow), as applied to real-world image analysis pipelines drawn from astronomy and neuroscience. We find that each tool has distinct advantages and shortcomings, which point the way to new research opportunities in making large-scale scientific image analysis both efficient and easy to use.

Jake VanderPlas

July 14, 2017
Tweet

More Decks by Jake VanderPlas

Other Decks in Research

Transcript

  1. Image Analysis At Scale:
    A Comparison of Five Systems
    Jake VanderPlas @jakevdp
    SciPy 2017, July 13 2017
    Slides at:
    http://speakerdeck.com/jakevdp/image-analysis-at-scale/

    View Slide

  2. Image Analysis At Scale:
    A Comparison of Five Systems
    Jake VanderPlas @jakevdp
    SciPy 2017, July 13 2017
    Parmita Mehta, Sven Dorkenwald, Dongfang Zhao, Tomer Kaftan, Alvin Cheung, Magdalena
    Balazinska, Ariel Rokem, Andrew Connolly, Jake VanderPlas, Yusra AlSayyad
    Department
    of Astronomy

    View Slide

  3. Preprint available at https://arxiv.org/abs/1612.02485
    The full technical report will be given
    this summer at the VLDB conference:

    View Slide

  4. How to Write a CS Paper . . .

    View Slide

  5. How to Write a CS Paper . . .
    1. Find a well-defined computing problem
    “Efficient generation of Fibonacci numbers is a perennial problem in
    Computer Science, and no agreed-upon standard solution yet exists.”

    View Slide

  6. How to Write a CS Paper . . .
    1. Find a well-defined computing problem
    “Efficient generation of Fibonacci numbers is a perennial problem in
    Computer Science, and no agreed-upon standard solution yet exists.”
    2. Design a tool that solves that problem efficiently
    “We present FibDB, the first ever relational database specifically designed for
    the generation and storage of numbers in the Fibonacci sequence.”

    View Slide

  7. How to Write a CS Paper . . .
    1. Find a well-defined computing problem
    “Efficient generation of Fibonacci numbers is a perennial problem in
    Computer Science, and no agreed-upon standard solution yet exists.”
    2. Design a tool that solves that problem efficiently
    “We present FibDB, the first ever relational database specifically designed for
    the generation and storage of numbers in the Fibonacci sequence.”
    3. Show that it’s 1000x faster than Hadoop.

    View Slide

  8. How to Write a CS Paper . . .
    1. Find a well-defined computing problem
    “Efficient generation of Fibonacci numbers is a perennial problem in
    Computer Science, and no agreed-upon standard solution yet exists.”
    2. Design a tool that solves that problem efficiently
    “We present FibDB, the first ever relational database specifically designed for
    the generation and storage of numbers in the Fibonacci sequence.”
    3. Show that it’s 1000x faster than Hadoop.
    Use a bar chart. With log scales.

    View Slide

  9. How to Write a CS Paper . . .
    1. Find a well-defined computing problem
    “Efficient generation of Fibonacci numbers is a perennial problem in
    Computer Science, and no agreed-upon standard solution yet exists.”
    2. Design a tool that solves that problem efficiently
    “We present FibDB, the first ever relational database specifically designed for
    the generation and storage of numbers in the Fibonacci sequence.”
    3. Show that it’s 1000x faster than Hadoop.
    Use a bar chart. With log scales.
    4. Repeat until tenured.

    View Slide

  10. ( I’m so sorry )

    View Slide

  11. Preprint available at https://arxiv.org/abs/1612.02485
    Paper Goal: evaluate existing Big Data systems on real-world
    scientific image analysis workflows & point the
    way forward for database & systems researchers.

    View Slide

  12. Goals of This Talk:
    Distill lessons learned for the SciPy audience,
    which is largely made up of scientific practitioners.
    Give a general idea of the strengths and weaknesses
    of each system, and what you might expect if
    applying it to your own research task.

    View Slide

  13. Challenges for Scaling Scientific Image Analysis
    1. Individual images are BIG, and typical
    databases aren’t optimized for very
    large data units.
    2. Images generally stored in domain-
    specific formats (FITS, NIfTI-1, etc.)
    3. Requires specialized operations (e.g.
    filtering, aggregations, slicing,
    stencils, spatial joins)
    4. Requires specialized analytics (e.g.
    background estimation, source
    detection, model fitting)
    Case Study: NeuroImaging
    Case Study: Astronomy

    View Slide

  14. Neuroscience Case Study
    Step 1: Segmentation
    Separate foreground from
    background using Otsu
    segmentation algorithm

    View Slide

  15. Neuroscience Case Study
    Step 1: Segmentation
    Separate foreground from
    background using Otsu
    segmentation algorithm
    Step 2: Denoising
    Use a local means filter to
    remove noise from images

    View Slide

  16. Neuroscience Case Study
    Step 1: Segmentation
    Separate foreground from
    background using Otsu
    segmentation algorithm
    Step 2: Denoising
    Use a local means filter to
    remove noise from images
    Step 3: Model Fitting
    Fit a tensor model to
    describe diffusion within
    each voxel

    View Slide

  17. Database architecture purpose-built
    for computation on multi-dim arrays.
    Python package aimed at
    parallelization of scientific workflows
    Shared-nothing DBMS developed by
    members of our UW team
    Popular in-memory big data system
    with wide adoption & Python interface
    System optimized for operations on
    N-dimensional tensors.
    Five Systems:

    View Slide

  18. from scidbpy import connect
    sdb = connect(url="...")
    data_sdb = sdb.from_array(data)
    data_filtered = data_sdb.compress(
    sdb.from_array(gtab.b0s_mask),
    axis=3) # Filter
    mean_b0_sdb = data_filtered.mean(index=3) # Mean
    Language: AQL/AFL or NumPy-like Syntax
    UDFs*: Python UDF support via stream() interface
    Data: Ingested as CSV, passed around pipelines as TSV
    Neuroscience Filter & Mean Operation
    *UDF = “User Defined Function"

    View Slide

  19. Advantages:
    - Efficient native support for dense arrays & common
    operations (windows, joins, etc.)
    - Python UDFs supported via stream() interface
    Challenges:
    - Data passed to UDFs in TSV format, leading to
    significant data transformation overhead in the pipeline
    - Difficult installation process, no good support for cloud
    deployment
    - Integration with external packages (e.g. LSST stack) is
    quite difficult
    - stream() I/O read through stdin/stdout only, which
    breaks if the UDF uses this for other purposes

    View Slide

  20. modelsRDD = imgRDD
    .map(lambda x:denoise(x,mask))
    .flatMap(lambda x: repart(x, mask))
    .groupBy(lambda x: (x[0][0],x[0][1]))
    .map(regroup)
    .map(fitmodel)
    Language: functional programming API
    UDFs: Built-in support for Python UDFs
    Data: Spark-specific RDDs (Resilient Distributed Datasets)
    Neuroscience Denoising & Model
    Fitting Operations

    View Slide

  21. Advantages:
    - Arbitrary Python objects as keys & straightforward
    Python UDFs streamlined implementation
    - Succinct functional programming interface written in
    Python
    - Large user community and extensive documentation
    Challenges:
    - Cacheing of intermediate results is not automatic, which
    can lead to silent repeated computation
    - Initial implementation easy, but required extensive
    tuning to attain computational efficiency

    View Slide

  22. conn = MyriaConnection(url="...")
    conn.create_function("Denoise", Denoise)
    query = MyriaQuery.submit("""
    T1 = SCAN(Images);
    T2 = SCAN(Mask);
    Joined = [SELECT T1.subjId, T1.imgId, T1.img, T2.mask
    FROM T1, T2
    WHERE T1.subjId = T2.subjId];
    Denoised = [FROM Joined EMIT
    PYUDF(Denoise, T1.img, T1.mask) as img,
    T1.subjId, T1.imgId];
    """)
    Language: MyriaL hybrid declarative/imperative language
    UDFs: Built-in support for Python UDFs
    Data: Flexible BLOB format (here: NumPy arrays)
    Neuroscience Denoising Operation

    View Slide

  23. Advantages:
    - Can directly leverage existing Python implementations
    - Declarative/Imperative MyriaL syntax is more flexible
    that typical DB languages (e.g. easily supports iteration)
    Challenges:
    - Greatest efficiency attained by reimplementation of key
    pieces of the algorithm
    - Initial implementation easy, but required extensive
    tuning to attain computational efficiency

    View Slide

  24. -
    for id in subjectIds:
    data[id].vols = delayed(downloadAndFilter)(id)
    for id in subjectIds: # barrier data
    [id].numVols = len(data[id].vols.result())
    for id in subjectIds:
    means = [delayed(mean)(block) for block in
    partitionVoxels(data[id].vols)]
    means = delayed(reassemble)(means)
    mask = delayed(median_otsu)(means)
    Language: Pure Python
    UDFs: Supported via delayed(xxx)
    Data: anything Python can handle
    Neuroscience Filter & Mean Operation

    View Slide

  25. Advantages:
    - Simplest installation & deployment
    - Python from the ground-up with familiar interfaces
    - Built-in Python UDFs: required little re-implementation
    of algorithms
    Challenges:
    - User must reason about when to insert evaluation
    barriers in graphs
    - User must choose manually how data should be
    partitioned across nodes
    - Options like futures and delayed make Dask flexible,
    but somewhat harder to use.
    - Difficult to debug: failed tasks go to a no-worker queue
    & can cause deadlock

    View Slide

  26. pl_inputs = []
    work = []
    for i_worker in range(len(steps[0])):
    with tf.device(steps[0][i_worker]):
    pl_inputs.append(tf.placeholder(shape=sh))
    work.append(tf.reduce_mean(pl_inputs[-1]))
    mean_data = []
    Language: Python used to manually set up workers
    UDFs: Not supported
    Data: TF-specific data structures, must be loaded on
    master node & distributed manually
    Neuroscience Filter & Mean Setup

    View Slide

  27. Advantages:
    -
    Challenges:
    - Limited support for distributed computation – user must
    manually map data & computation to workers
    - 2GB serialized graph size limit means pipeline had to be
    manually broken into smaller steps
    - Lack of Python UDFs requires complete re-
    implementation of algorithm using tensorflow primitives
    - Limited set of built-in operations (e.g. does not support
    element-wise data assignment)
    (It’s clear that we are attempting to push tensorflow well beyond its
    design goals. It’s still an excellent tool for what it was designed for,
    namely deep learning workflows)

    View Slide

  28. Neuroscience: End-to-end Pipeline
    - Dask/Myria/Spark: similar performance, as they are all essentially
    distributing the same Python UDFs
    - SciDB: slower primarily due to conversion of data to/from TSV at the
    input/output of each Python UDF
    - Tensorflow: slower due to many limitations previously discussed

    View Slide

  29. See our paper for more detailed quantitative
    breakdown & discussion
    https://arxiv.org/abs/1612.02485

    View Slide

  30. Key Takeaways:
    Dask
    Myria
    SciDB
    Spark
    Tensorflow

    View Slide

  31. Key Takeaways:
    Scientific pipelines are complex enough that they
    rarely map onto built-in primitives for existing big data
    systems.
    Sufficient Primitives
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  32. Key Takeaways:
    In the meantime, seamless support for user-defined
    functions (UDFs) is absolutely essential for scientific
    use-cases
    Sufficient Primitives
    Python UDF Support
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  33. Key Takeaways:
    Sufficient Primitives
    Support for flexible domain-specific data formats in
    pipelines it very important for any nontrivial
    computational task
    Python UDF Support
    Flexible data formats
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  34. Key Takeaways:
    Sufficient Primitives
    Ideally, parallel computations & memory usage
    should be tuned automatically by the systems. None
    of the explored systems do this particularly well.
    Python UDF Support
    Flexible data formats
    Automatic tuning
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  35. Key Takeaways:
    Sufficient Primitives
    Installation headaches are the easiest way to drive
    frustration. Streamlined installation, particularly on the
    cloud, is a must
    Python UDF Support
    Flexible data formats
    Streamlined Installation
    Automatic tuning
    Dask
    Myria
    SciDB
    Spark
    Tensorflow
    N/A

    View Slide

  36. Dask
    Myria
    SciDB
    Spark
    Tensorflow
    Key Takeaways:
    Sufficient Primitives
    A large and active user & developer community
    makes solving problems & getting questions
    answered much easier.
    Python UDF Support
    Flexible data formats
    Streamlined Installation
    Large User Community
    Automatic tuning
    N/A

    View Slide

  37. Who wins?
    Lack of primitives means each is an exercise in
    sending Python UDFs to data on distributed nodes.
    This is an ancillary mode of computation for most
    systems, and skips many of their efficiencies.
    Exception is Dask, which is specifically designed
    for this mode of computation.
    Bottom Line: Use Dask unless you know your
    use-case is covered by other systems’ primatives.

    View Slide

  38. Email: [email protected]
    Twitter: @jakevdp
    Github: jakevdp
    Web: http://vanderplas.com/
    Blog: http://jakevdp.github.io/
    Thank You!
    Paper preprint: https://arxiv.org/abs/1612.02485
    Slides: http://speakerdeck.com/jakevdp/image-analysis-at-scale/
    Associated code is in a private GitLab repository and will be
    released after VLDB in August

    View Slide

  39. View Slide

  40. Key Takeaway:
    Existing big data systems have many
    potential areas of improvement
    for supporting scientific workflows.
    We hope our paper will point
    the way for researchers
    developing these systems

    View Slide

  41. Extra Slides

    View Slide

  42. Case Studies:
    Neuro-Imaging
    Human Connectome Project
    900 subjects x 288 3D dMRI
    “images”, 145 x 145 x 174 voxels
    Total size: 105GB, NIfTI-1 format
    Tasks:
    - Segmentation & Masking
    - Denoising
    - Model Fitting

    View Slide

  43. Case Studies:
    Neuro-Imaging
    Astronomy
    High Cadence Transient Survey
    24 Visits x 60 2D Images + noise
    estimates, 4000 x 4072 pixels
    Total size: 115GB, FITS format
    Tasks:
    - Pre-processing & Cleaning
    - Patch creation
    - Co-addition
    - Source detection
    Human Connectome Project
    900 subjects x 288 3D dMRI
    “images”, 145 x 145 x 174 voxels
    Total size: 105GB, NIfTI-1 format
    Tasks:
    - Segmentation & Masking
    - Denoising
    - Model Fitting

    View Slide

  44. Evaluation:
    Qualitative:
    - How easy is it to implement scientific pipelines?
    - Can existing pipelines run on the system?
    - How much effort is required to implement?
    - How much technical expertise is required to
    optimize the system?
    Quantitative:
    - What is the memory consumption?
    - What is the end-to-end runtime?
    - What is the runtime for each implemented step?

    View Slide

  45. Neuroscience: Data Ingest
    SciDB 1: data ingest via NumPy array
    SciDB 2: data ingest direct from CSV

    View Slide

  46. Neuroscience: Filter and Mean

    View Slide

  47. Neuroscience: Denoise and Model Fit

    View Slide

  48. View Slide

  49. Astro Pipeline

    View Slide

  50. Lessons Learned for Developers
    Scientific image analytics requires:
    - Easy manipulation of multidimensional array data
    - Processing with sophisticated UDFs and UDAs
    More generally:
    - Make systems easy to deploy and easy to debug
    - Automatically tune degree of parallelism and other
    configuration parameters
    - Gracefully spill to disk: out-of-memory errors remain
    too common
    - Read existing, scientific file formats

    View Slide

  51. Lessons Learned for Users
    Key Decision: Reuse or Rewrite
    - Rewriting code can yield higher performance
    - Reusing saves time and avoids new bugs
    Turning a serial computation into a parallel
    computation remains challenging

    View Slide

  52. Lessons Learned for Researchers
    Need to efficiently support pipelines with UDFs
    Image analytics is memory intensive
    - Need to efficiently manage memory
    - Individual records are large
    Self-tuning & robust systems are a must.

    View Slide

  53. View Slide