Dataflow oriented bioinformatics pipelines with Nextflow

Dataﬂow oriented bioinformatics pipelines Paolo Di Tommaso Notredame’s lab CRG,
20th June ’13

The computer scientist approach 1. Database server 2. Parse proteins
3. Populate DB 4. You need a client app to access DB 5.select count(*) from table.proteins

The bioinformaticians’ way cat ~/proteins.fa | grep '>' | wc
-‐l

The good • It allows quick and easy data extraction/
manipulation • No dependencies on external servers • It enables fast prototyping and experiments multiple alternative easily • Linux is the integration layer in the Bioinformatics domains

The bad • Non standard tools need to be compiled
on the target platform • As script get bigger as it becomes fragile/ un-readable/hard to modify • In large script tasks inputs/outputs tends to be not clearly deﬁned

The ugly • It’s very hard to get it to
scale properly with big data • Very different parallelization strategies and implementations • Shared memory (process, thread) • Message passing (MPI, actors) • Distributed computation • Hardware parallelization (GPU) • In general provide a too low level abstraction and requires speciﬁc API skills • Introduce hard dependencies to speciﬁc platform/framework

What we need Compose Linux commands and scripts as usual
+ High level parallelization model (ideally platform agnostic)

Dataﬂow • Declarative computational model for concurrent tasks execution1 •
Originates in research for reactive system to monitor and control industrial processes • All tasks are parallel and form a process network • Tasks communicate through channels (non-blocking unidirectional FIFO queue) • A task is executed when all its inputs are bound • The synchronization is implicitly deﬁned by tasks inputs/ outputs declaration 1. G. Kahn, “The Semantics of a Simple Language for Parallel Programming,” Proc. of the IFIP Congress 74, North-Holland Publishing Co., 1974

Basic example a = new Dataflow() b = new Dataflow()
c = new Dataflow() task { a << b + ' ' + c } task { b << 'Hello' } task { c << 'World!' } print a

Nextﬂow • Tasks declares inputs/outputs and a Linux executable script
• It is executed as soon as declared inputs are available • It is inherently parallel • Each tasks is executed in its own private directory • Produced outputs trigger the execution of downstream tasks

Nextﬂow task task ('optional name') { input
file_in output file_out output '*.fa': channel """ your BASH script : """ }

Parallel BLAST example query = file(args[0]) DB = "$HOME/blast-‐db/pdb/pdb" seq
= channel() query.chunkFasta { seq << it } task { input seq output blast_result "echo '$seq' | blastp -‐db $DB -‐query $seq -‐outfmt 6 > blast_result" } task { input blast_result "cat $blast_result" }

Mixing Languages task { "your BASH script ..
" } task { """ #!/usr/bin/env perl your glory PERL script .. """ } task { """ #!/usr/bin/env python your Python code .. """ }

Configurable execution layer • The same pipeline can run on
different platforms by a simple definition into the configuration file • Local processes (Java threads) • Resource managers (SGE, LSF, SLURM, etc) • Cloud (Amazon Ec2, Google, etc)

Resume execution • When a task crash, the pipeline stops
gracefully and reports the error cause • Easy debugging, each tasks can be executed separately • When ﬁxed, the execution can be resumed from the failure point

A case study: Pipe-R • Pipeline for the detection and
mapping of long non- coding RNAs • ~ 10'000 lines of Perl code • run a single computer (single process) or cluster through SGE • ~ 70% of the code deals with parameters handling, ﬁle splitting, jobs parallelization, synchronization, etc. • very inefﬁcient parallelization

! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species!
Anchor!2! Pipe%R' Anchor!1! Anchor!1! Anchor!3! Blast! Align! Homolog!A1! Homolog!A2! Homolog!A3! Homolog!B1!

Problems • The slowest job stops all the computation •
Parallelization depends on number of genomes (1 genomes no parallel execution) • A speciﬁc resource manages technology (qsub) is hard-coded into the pipeline • If it crash, you lost days of computation

Piper-NF • Pipe-R implementation based on Nextﬂow • 350 lines
of code vs. 10'000 legacy version • Much easier to write, test and to maintain • Implementation platform agnostic • Parallelized splitting by query and genome • Fine grain control on parallelization • Greatly improved tasks “interleaving”

! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species!
Anchor!2! Piper&NF:*Let*it*ﬂow…* Anchor!1! Anchor!1! Anchor!3! Blast! Align! Homolog!A1! Homolog!A2! Homolog!A3! Homolog!B1!

Piper-NF benchmark • Query with 665 sequences • Mapping L.mania
22 genomes 30 mins 60 mins 90 mins 120 mins 150 mins 180 mins legacy 50 100 150 Legacy* No split Chunk 200 Chunk 100 Chunk 50 Chunk 25 cluster nodes * Partial (not including T-Coffee alignment step) 6 x faster!

What’s next • Support more data format (Fastaq, BAM, SAM)
• Enhance syntax to make more expressive • Advanced proﬁling (Paraver) • Improve grid support (SLURM, LSF, DRMAA, etc) • Integrate the cloud (Amazon, DNAnexus) • Add health monitoring and automatic fail-over

Noteworthy • Deployed as single executable packaged i.e. download and
run • Pipeline functional logic is decoupled by the actual execution layer (local/grid/cloud/?) • Reuse existing code/scripts • Parallelism is deﬁned implicitly by tasks inputs/ outputs declarations • On task error, stops gently reporting the error, and resume from failed point

Links • Nextflow http://nextflow-project.org • Dataflow http://www.gpars.org/1.0.0/guide/guide/ dataflow.html • Piper-NF
http://github.com/cbcrg/piper-nf

Dataflow oriented bioinformatics pipelines with...

Dataflow oriented bioinformatics pipelines with Nextflow

Paolo Di Tommaso

More Decks by Paolo Di Tommaso

Other Decks in Programming

Featured

Transcript

Dataﬂow oriented bioinformatics pipelines Paolo Di Tommaso Notredame’s lab CRG,

The computer scientist approach 1. Database server 2. Parse proteins

The bioinformaticians’ way cat ~/proteins.fa | grep '>' | wc

The good • It allows quick and easy data extraction/

The bad • Non standard tools need to be compiled

The ugly • It’s very hard to get it to

What we need Compose Linux commands and scripts as usual

Dataﬂow • Declarative computational model for concurrent tasks execution1 •

Basic example a = new Dataflow() b = new Dataflow()

Nextﬂow • Tasks declares inputs/outputs and a Linux executable script

Nextﬂow task task ('optional name') { input

Parallel BLAST example query = file(args[0]) DB = "$HOME/blast-‐db/pdb/pdb" seq

Mixing Languages task { "your BASH script ..

Conﬁgurable execution layer • The same pipeline can run on

Resume execution • When a task crash, the pipeline stops

A case study: Pipe-R • Pipeline for the detection and

! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species!

Problems • The slowest job stops all the computation •

Piper-NF • Pipe-R implementation based on Nextﬂow • 350 lines

! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species!

Piper-NF benchmark • Query with 665 sequences • Mapping L.mania

What’s next • Support more data format (Fastaq, BAM, SAM)

Noteworthy • Deployed as single executable packaged i.e. download and

Links • Nextflow http://nextflow-project.org • Dataflow http://www.gpars.org/1.0.0/guide/guide/ dataflow.html • Piper-NF