Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Image Analysis At Scale

Image Analysis At Scale

(Presented 2017/07/14 at #SciPy2017)

Scientific discoveries are increasingly driven by the analysis of large volumes of image data, and many tools and systems have emerged to support distributed data storage and scalable computation. It is not always immediately clear, however, how well these systems support real-world scientific use cases. Our team set out to evaluate the performance and ease-of-use of five such systems (SciDB, Myria, Spark, Dask, and TensorFlow), as applied to real-world image analysis pipelines drawn from astronomy and neuroscience. We find that each tool has distinct advantages and shortcomings, which point the way to new research opportunities in making large-scale scientific image analysis both efficient and easy to use.

Jake VanderPlas

July 14, 2017
Tweet

More Decks by Jake VanderPlas

Other Decks in Research

Transcript

  1. Image Analysis At Scale: A Comparison of Five Systems Jake

    VanderPlas @jakevdp SciPy 2017, July 13 2017 Slides at: http://speakerdeck.com/jakevdp/image-analysis-at-scale/
  2. Image Analysis At Scale: A Comparison of Five Systems Jake

    VanderPlas @jakevdp SciPy 2017, July 13 2017 Parmita Mehta, Sven Dorkenwald, Dongfang Zhao, Tomer Kaftan, Alvin Cheung, Magdalena Balazinska, Ariel Rokem, Andrew Connolly, Jake VanderPlas, Yusra AlSayyad Department of Astronomy
  3. How to Write a CS Paper . . . 1.

    Find a well-defined computing problem “Efficient generation of Fibonacci numbers is a perennial problem in Computer Science, and no agreed-upon standard solution yet exists.”
  4. How to Write a CS Paper . . . 1.

    Find a well-defined computing problem “Efficient generation of Fibonacci numbers is a perennial problem in Computer Science, and no agreed-upon standard solution yet exists.” 2. Design a tool that solves that problem efficiently “We present FibDB, the first ever relational database specifically designed for the generation and storage of numbers in the Fibonacci sequence.”
  5. How to Write a CS Paper . . . 1.

    Find a well-defined computing problem “Efficient generation of Fibonacci numbers is a perennial problem in Computer Science, and no agreed-upon standard solution yet exists.” 2. Design a tool that solves that problem efficiently “We present FibDB, the first ever relational database specifically designed for the generation and storage of numbers in the Fibonacci sequence.” 3. Show that it’s 1000x faster than Hadoop.
  6. How to Write a CS Paper . . . 1.

    Find a well-defined computing problem “Efficient generation of Fibonacci numbers is a perennial problem in Computer Science, and no agreed-upon standard solution yet exists.” 2. Design a tool that solves that problem efficiently “We present FibDB, the first ever relational database specifically designed for the generation and storage of numbers in the Fibonacci sequence.” 3. Show that it’s 1000x faster than Hadoop. Use a bar chart. With log scales.
  7. How to Write a CS Paper . . . 1.

    Find a well-defined computing problem “Efficient generation of Fibonacci numbers is a perennial problem in Computer Science, and no agreed-upon standard solution yet exists.” 2. Design a tool that solves that problem efficiently “We present FibDB, the first ever relational database specifically designed for the generation and storage of numbers in the Fibonacci sequence.” 3. Show that it’s 1000x faster than Hadoop. Use a bar chart. With log scales. 4. Repeat until tenured.
  8. Preprint available at https://arxiv.org/abs/1612.02485 Paper Goal: evaluate existing Big Data

    systems on real-world scientific image analysis workflows & point the way forward for database & systems researchers.
  9. Goals of This Talk: Distill lessons learned for the SciPy

    audience, which is largely made up of scientific practitioners. Give a general idea of the strengths and weaknesses of each system, and what you might expect if applying it to your own research task.
  10. Challenges for Scaling Scientific Image Analysis 1. Individual images are

    BIG, and typical databases aren’t optimized for very large data units. 2. Images generally stored in domain- specific formats (FITS, NIfTI-1, etc.) 3. Requires specialized operations (e.g. filtering, aggregations, slicing, stencils, spatial joins) 4. Requires specialized analytics (e.g. background estimation, source detection, model fitting) Case Study: NeuroImaging Case Study: Astronomy
  11. Neuroscience Case Study Step 1: Segmentation Separate foreground from background

    using Otsu segmentation algorithm Step 2: Denoising Use a local means filter to remove noise from images
  12. Neuroscience Case Study Step 1: Segmentation Separate foreground from background

    using Otsu segmentation algorithm Step 2: Denoising Use a local means filter to remove noise from images Step 3: Model Fitting Fit a tensor model to describe diffusion within each voxel
  13. Database architecture purpose-built for computation on multi-dim arrays. Python package

    aimed at parallelization of scientific workflows Shared-nothing DBMS developed by members of our UW team Popular in-memory big data system with wide adoption & Python interface System optimized for operations on N-dimensional tensors. Five Systems:
  14. from scidbpy import connect sdb = connect(url="...") data_sdb = sdb.from_array(data)

    data_filtered = data_sdb.compress( sdb.from_array(gtab.b0s_mask), axis=3) # Filter mean_b0_sdb = data_filtered.mean(index=3) # Mean Language: AQL/AFL or NumPy-like Syntax UDFs*: Python UDF support via stream() interface Data: Ingested as CSV, passed around pipelines as TSV Neuroscience Filter & Mean Operation *UDF = “User Defined Function"
  15. Advantages: - Efficient native support for dense arrays & common

    operations (windows, joins, etc.) - Python UDFs supported via stream() interface Challenges: - Data passed to UDFs in TSV format, leading to significant data transformation overhead in the pipeline - Difficult installation process, no good support for cloud deployment - Integration with external packages (e.g. LSST stack) is quite difficult - stream() I/O read through stdin/stdout only, which breaks if the UDF uses this for other purposes
  16. modelsRDD = imgRDD .map(lambda x:denoise(x,mask)) .flatMap(lambda x: repart(x, mask)) .groupBy(lambda

    x: (x[0][0],x[0][1])) .map(regroup) .map(fitmodel) Language: functional programming API UDFs: Built-in support for Python UDFs Data: Spark-specific RDDs (Resilient Distributed Datasets) Neuroscience Denoising & Model Fitting Operations
  17. Advantages: - Arbitrary Python objects as keys & straightforward Python

    UDFs streamlined implementation - Succinct functional programming interface written in Python - Large user community and extensive documentation Challenges: - Cacheing of intermediate results is not automatic, which can lead to silent repeated computation - Initial implementation easy, but required extensive tuning to attain computational efficiency
  18. conn = MyriaConnection(url="...") conn.create_function("Denoise", Denoise) query = MyriaQuery.submit(""" T1 =

    SCAN(Images); T2 = SCAN(Mask); Joined = [SELECT T1.subjId, T1.imgId, T1.img, T2.mask FROM T1, T2 WHERE T1.subjId = T2.subjId]; Denoised = [FROM Joined EMIT PYUDF(Denoise, T1.img, T1.mask) as img, T1.subjId, T1.imgId]; """) Language: MyriaL hybrid declarative/imperative language UDFs: Built-in support for Python UDFs Data: Flexible BLOB format (here: NumPy arrays) Neuroscience Denoising Operation
  19. Advantages: - Can directly leverage existing Python implementations - Declarative/Imperative

    MyriaL syntax is more flexible that typical DB languages (e.g. easily supports iteration) Challenges: - Greatest efficiency attained by reimplementation of key pieces of the algorithm - Initial implementation easy, but required extensive tuning to attain computational efficiency
  20. - for id in subjectIds: data[id].vols = delayed(downloadAndFilter)(id) for id

    in subjectIds: # barrier data [id].numVols = len(data[id].vols.result()) for id in subjectIds: means = [delayed(mean)(block) for block in partitionVoxels(data[id].vols)] means = delayed(reassemble)(means) mask = delayed(median_otsu)(means) Language: Pure Python UDFs: Supported via delayed(xxx) Data: anything Python can handle Neuroscience Filter & Mean Operation
  21. Advantages: - Simplest installation & deployment - Python from the

    ground-up with familiar interfaces - Built-in Python UDFs: required little re-implementation of algorithms Challenges: - User must reason about when to insert evaluation barriers in graphs - User must choose manually how data should be partitioned across nodes - Options like futures and delayed make Dask flexible, but somewhat harder to use. - Difficult to debug: failed tasks go to a no-worker queue & can cause deadlock
  22. pl_inputs = [] work = [] for i_worker in range(len(steps[0])):

    with tf.device(steps[0][i_worker]): pl_inputs.append(tf.placeholder(shape=sh)) work.append(tf.reduce_mean(pl_inputs[-1])) mean_data = [] Language: Python used to manually set up workers UDFs: Not supported Data: TF-specific data structures, must be loaded on master node & distributed manually Neuroscience Filter & Mean Setup
  23. Advantages: - Challenges: - Limited support for distributed computation –

    user must manually map data & computation to workers - 2GB serialized graph size limit means pipeline had to be manually broken into smaller steps - Lack of Python UDFs requires complete re- implementation of algorithm using tensorflow primitives - Limited set of built-in operations (e.g. does not support element-wise data assignment) (It’s clear that we are attempting to push tensorflow well beyond its design goals. It’s still an excellent tool for what it was designed for, namely deep learning workflows)
  24. Neuroscience: End-to-end Pipeline - Dask/Myria/Spark: similar performance, as they are

    all essentially distributing the same Python UDFs - SciDB: slower primarily due to conversion of data to/from TSV at the input/output of each Python UDF - Tensorflow: slower due to many limitations previously discussed
  25. Key Takeaways: Scientific pipelines are complex enough that they rarely

    map onto built-in primitives for existing big data systems. Sufficient Primitives Dask Myria SciDB Spark Tensorflow N/A
  26. Key Takeaways: In the meantime, seamless support for user-defined functions

    (UDFs) is absolutely essential for scientific use-cases Sufficient Primitives Python UDF Support Dask Myria SciDB Spark Tensorflow N/A
  27. Key Takeaways: Sufficient Primitives Support for flexible domain-specific data formats

    in pipelines it very important for any nontrivial computational task Python UDF Support Flexible data formats Dask Myria SciDB Spark Tensorflow N/A
  28. Key Takeaways: Sufficient Primitives Ideally, parallel computations & memory usage

    should be tuned automatically by the systems. None of the explored systems do this particularly well. Python UDF Support Flexible data formats Automatic tuning Dask Myria SciDB Spark Tensorflow N/A
  29. Key Takeaways: Sufficient Primitives Installation headaches are the easiest way

    to drive frustration. Streamlined installation, particularly on the cloud, is a must Python UDF Support Flexible data formats Streamlined Installation Automatic tuning Dask Myria SciDB Spark Tensorflow N/A
  30. Dask Myria SciDB Spark Tensorflow Key Takeaways: Sufficient Primitives A

    large and active user & developer community makes solving problems & getting questions answered much easier. Python UDF Support Flexible data formats Streamlined Installation Large User Community Automatic tuning N/A
  31. Who wins? Lack of primitives means each is an exercise

    in sending Python UDFs to data on distributed nodes. This is an ancillary mode of computation for most systems, and skips many of their efficiencies. Exception is Dask, which is specifically designed for this mode of computation. Bottom Line: Use Dask unless you know your use-case is covered by other systems’ primatives.
  32. Email: [email protected] Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/

    Thank You! Paper preprint: https://arxiv.org/abs/1612.02485 Slides: http://speakerdeck.com/jakevdp/image-analysis-at-scale/ Associated code is in a private GitLab repository and will be released after VLDB in August
  33. Key Takeaway: Existing big data systems have many potential areas

    of improvement for supporting scientific workflows. We hope our paper will point the way for researchers developing these systems
  34. Case Studies: Neuro-Imaging Human Connectome Project 900 subjects x 288

    3D dMRI “images”, 145 x 145 x 174 voxels Total size: 105GB, NIfTI-1 format Tasks: - Segmentation & Masking - Denoising - Model Fitting
  35. Case Studies: Neuro-Imaging Astronomy High Cadence Transient Survey 24 Visits

    x 60 2D Images + noise estimates, 4000 x 4072 pixels Total size: 115GB, FITS format Tasks: - Pre-processing & Cleaning - Patch creation - Co-addition - Source detection Human Connectome Project 900 subjects x 288 3D dMRI “images”, 145 x 145 x 174 voxels Total size: 105GB, NIfTI-1 format Tasks: - Segmentation & Masking - Denoising - Model Fitting
  36. Evaluation: Qualitative: - How easy is it to implement scientific

    pipelines? - Can existing pipelines run on the system? - How much effort is required to implement? - How much technical expertise is required to optimize the system? Quantitative: - What is the memory consumption? - What is the end-to-end runtime? - What is the runtime for each implemented step?
  37. Lessons Learned for Developers Scientific image analytics requires: - Easy

    manipulation of multidimensional array data - Processing with sophisticated UDFs and UDAs More generally: - Make systems easy to deploy and easy to debug - Automatically tune degree of parallelism and other configuration parameters - Gracefully spill to disk: out-of-memory errors remain too common - Read existing, scientific file formats
  38. Lessons Learned for Users Key Decision: Reuse or Rewrite -

    Rewriting code can yield higher performance - Reusing saves time and avoids new bugs Turning a serial computation into a parallel computation remains challenging
  39. Lessons Learned for Researchers Need to efficiently support pipelines with

    UDFs Image analytics is memory intensive - Need to efficiently manage memory - Individual records are large Self-tuning & robust systems are a must.