Improve portability of bioinformatics software across HPC and cloud infrastructures

Improve portability of bioinformatics software across HPC and cloud infrastructures
6th International IBM Cloud Academy Conference 2018 25 May 2018 @ The Institute of Statistical Mathematics Tazro Ohta, Database Center for Life Science Tomoya Tanjo, National Institute of Informatics Osamu Ogasawara, National Institute of Genetics

Acknowledgement Intercloud CREST group Staff of DDBJ and DBCLS Galaxy
Community Japan Community members of Galaxy, CWL, and OBF

Background "Genomics as a big data science"

Many samples for many purposes Unlike the other area, bioinformatics
data analysis comes with Data explosion 100MB‑100GB per sample 10‑100,000 samples per research Large‑scale int'l collaborations / many individual projects Thousands of software tools Open source command line tools from individual developers "Workflow" to connect tools and iterate for samples "Routinely unique"

#tools > 2,000

Routinely unique 46 data analysis projects in 18 months required
34 different types of analysis "Many projects required customized techniques that used only once" Chang J. (2015) Core services: Reward bioinformaticians. Nature

Routines breakdown Set up servers Install and configure tools and
workflows Transfer data Fetch reference data from the public databases Manage preprocessing jobs Set up interactive environment for statistics (e.g. Jupyter) Collect and keep result data Repeat on reviewer's demand Can clouds help us to reduce the cost of the routines?

Packaging bioinformatics tools and workflows Getting out of dependencies hell

Packaging tools Efforts for years to containerize tools Containers are
now provided for most of popular tools biocontainers bioboxes Ongoing trials for other containers for HPC udocker singularity

Benchmark: native vs docker Native vs Docker execution benchmark comparison
"the observed standard deviation is smaller when running with Docker" Di Tommaso P, et al. (2015) The impact of Docker containers on the performance of genomic pipelines. PeerJ

Packaging workflows Tools are often used as components of a
workflow Everyone has their favorite wf job management software; Galaxy, Taberna, Airflow, Nextflow, Cromwell, and.. shell Sharing workflows among the different environments is hard

The age of Common Workflow Language "A specification for describing
analysis workflows and tools" A new open‑source community standard since 2014 for all data analysis tasks, not only bioinformatics describes structure of tools and workflows in YAML base container image, base command, input/output

CWL: how it works Requirements tool and workflow definition files
(.cwl) job configuration file (.yaml or .json) workflow engine that supports CWL execution

CWL in action ‑ tool definition c w l V
e r s i o n : v 1 . 0 c l a s s : C o m m a n d L i n e T o o l h i n t s : D o c k e r R e q u i r e m e n t : d o c k e r P u l l : i n u t a n o / r s e m : 0 . 1 . 0 b a s e C o m m a n d : [ " r s e m - c a l c u l a t e - e x p r e s s i o n " ] i n p u t s : t h r e a d s : t y p e : i n t i n p u t B i n d i n g : p r e f i x : - p f a s t q : t y p e : F i l e i n p u t B i n d i n g : p o s i t i o n : 1 o u t p u t s : r e a d s P e r G e n e : t y p e : F i l e o u t p u t B i n d i n g : g l o b : " * R e a d s P e r G e n e . o u t . t a b "

CWL in action ‑ workflow definition c w l V
e r s i o n : v 1 . 0 c l a s s : W o r k f l o w i n p u t s : r u n T h r e a d N : i n t d a t a U R L : s t r i n g o u t p u t s : r e a d s P e r G e n e : t y p e : F i l e o u t p u t S o u r c e : r s e m / r e a d s P e r G e n e s t e p s : d o w n l o a d _ d a t a : r u n : d o w n l o a d _ d a t a . c w l i n : d a t a U R L : d a t a U R L o u t : [ d a t a F i l e s ] r s e m : r u n : r s e m . c w l i n : t h r e a d s : r u n T h r e a d N d a t a : d o w n l o a d / d a t a F i l e s o u t : [ r e a d s P e r G e n e ]

CWL in action ‑ job configuration r u n T
h r e a d N : 8 d a t a U R L : f t p . d d b j . n i g . a c . j p / p a t h t o / e x a m p l e . f a s t q

CWL in action ‑ execution using CWL reference implementation (cwltool):
$ c w l t o o l r s e m _ w o r k f l o w . c w l r s e m _ w o r k f l o w _ j o b c o n f . y m l Basic idea CWL focuses on "What" of workflow structure of workflow incudling input, action, and output "How" should be determined by execution environment job scheduling and management are depending on engines

Implementations supporting CWL Software Platform support cwltool Linux, OS X,
Windows, local execution only Arvados AWS, GCP, Azure, Slurm Toil AWS, Azure, GCP, Grid Engine, LSF, Mesos, OpenStack, Slurm, PBS/Torque Rabix Bunny Linux, OS X, GA4GH TES (experimental) CWL‑ Airflow Linux, OS X REANA Kubernetes, CERN OpenStack (OpenStack Magnum) Cromwell local, HPC, Google, HtCondor CWLEXEC IBM Spectrum LSF 10.1.0.3+

OK now everything's portable... So where should I run my
workflows?

Select the best instance for given WF An ideal system
to optimize cloud instance selection will require resource usage data of past WF executions

Collecting resource usage of data analysis workflows

CWL‑metrics github.com/inutano/cwl‑metrics collects container resource usage including total CPU usage
max memory usage total disk IO exec time collects metadata of tools and workflows via cwltool easy to install, run (almost) everywhere

CWL‑metrics: How it works collect metrics data via influxdata/telegraf collect
WF metadata via cwltool store in elasticsearch, output summary data

Example: WFs on different instance type

Future work A compact summary file format taken with CWL
file Support more workflow engines Support multihost environment Support containers other than docker Summary Genomics need more machines, more easy‑to‑use clouds Packaging tools and workflows for easy migration to the clouds Collecting data for environment selection optimization Need more metrics data of wider variety of workflows

Improve portability of bioinformatics software ...

Improve portability of bioinformatics software across HPC and cloud infrastructures

Tazro Inutano Ohta

More Decks by Tazro Inutano Ohta

Other Decks in Research

Featured

Transcript