Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introducing Nextflow

Avatar for Paolo Di Tommaso Paolo Di Tommaso
February 04, 2016
760

Introducing Nextflow

Introduction to Nextflow pipeline framework given at CNAG, Barcelona

Avatar for Paolo Di Tommaso

Paolo Di Tommaso

February 04, 2016
Tweet

Transcript

  1. CHALLENGES • Optimise computation taking advantage of distributed cluster /

    cloud • Simplify deployment of complex pipelines
  2. COMPLEXITY • Dozens of dependencies (binary tools, compilers, libraries, system

    tools, etc) • Experimental nature of academic SW tends to be difficult to install, configure and deploy • Heterogeneous executing platforms and system architecture (laptop→supercomputer)
  3. UNIX PIPE MODEL cat seqs | blastp -query - |

    head 10 | t_coffee > result
  4. WHAT WE NEED Compose Linux commands and scripts as usual

    + Handle multiple inputs/outputs Portable across multiple platforms Fault tolerance
  5. NEXTFLOW • Fast application prototypes • High-level parallelisation model •

    Portable across multiple execution platforms • Enable pipeline reproducibility
  6. • A pipeline script is written by composition putting together

    several process • A process can execute any script or tool • It allows to reuse any existing piece of code
  7. process foo { input: val str from 'Hello' output: file

    'my_file' into result script: """ echo $str world! > my_file """ } PROCESS DEFINITION
  8. WHAT A SCRIPT LOOKS LIKE sequences = Channel.fromPath("/data/sample.fasta") process blast

    { input: file 'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt
 """ } align_result.collectFile(name: 'final_alignment')
  9. IMPLICIT PARALLELISM sequences = Channel.fromPath("/data/*.fasta") process blast { input: file

    'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt
 """ } align_result.collectFile(name: 'final_alignment')
  10. DATAFLOW • Declarative computational model for concurrent processes • Processes

    wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async stream of data called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
  11. • The executor abstraction layer allows you to run the

    same script on different platforms • Local (default) • Cluster (SGE, LSF, SLURM, Torque/PBS) • HPC (beta) • Cloud (beta)
  12. CLUSTER EXECUTOR nextflow login node NFS/GPFS cluster node cluster node

    cluster node cluster node batch scheduler submit tasks cluster node
  13. CONFIGURATION FILE process {
 executor = 'sge' 
 queue =

    'cn-el6'
 memory = '10GB'
 cpus = 8
 time = '2h'
 }
  14. HPC EXECUTOR Login node NFS/GPFS Job request cluster node cluster

    node Job wrapper #!/bin/bash #$ -q <queue> #$ -pe ompi <nodes> #$ -l virtual_free=<mem> mpirun nextflow run <your-pipeline> -with-mpi HPC cluster nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker
  15. BENEFITS • Smaller images (~100MB) • Fast instantiation time (<1sec)

    • Almost native performance • Easy to build, publish, share and deploy • Enable tools versioning and archiving
  16. PROS • Dead easy deployment procedure • Self-contained and precise

    controlled runtime • Rapidly reproduce any former configuration • Consistent results over time and across different platforms
  17. SHIFTER • Alternative implementation developed by NERSC (Berkeley lab) •

    HPC friendly, does not require special permission • Compatible with Docker images • Integrated with SLURM scheduler
  18. • Stop on failure / fix / resume executions •

    Automatically re-execute failing tasks increasing requested resources (memory, disk, etc.) • Ignore task errors
  19. WHO IS USING NEXTFLOW? • Campagne Lab, Weill Medical College

    of Cornell University • Center for Biotechnology, Bielefeld University • Genetic Cancer group, International Agency for Cancer Research • Guigo Lab, Center for Genomic Regulation • Medical genetics diagnostic, Oslo University Hospital • National Marrow Donor Program • Joint Genomic Institute • Parasite Genomics, Sanger Institute
  20. FUTURE WORK Short term • Built-in support for Shifter •

    Enhance scheduling capability of HPC execution mode • Version 1.0 (second half 2016) Long term • Web user interface • Enhance support for cloud (Google Compute Engine)
  21. CONCLUSION • Nextflow is a streaming oriented framework for computational

    workflows. • It is not supposed to replace your favourite tools • It provides a parallel and scalable environment for your scripts • It enables reproducible pipelines deployment