Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introducing Nextflow

Paolo Di Tommaso
February 04, 2016
690

Introducing Nextflow

Introduction to Nextflow pipeline framework given at CNAG, Barcelona

Paolo Di Tommaso

February 04, 2016
Tweet

Transcript

  1. CHALLENGES • Optimise computation taking advantage of distributed cluster /

    cloud • Simplify deployment of complex pipelines
  2. COMPLEXITY • Dozens of dependencies (binary tools, compilers, libraries, system

    tools, etc) • Experimental nature of academic SW tends to be difficult to install, configure and deploy • Heterogeneous executing platforms and system architecture (laptop→supercomputer)
  3. UNIX PIPE MODEL cat seqs | blastp -query - |

    head 10 | t_coffee > result
  4. WHAT WE NEED Compose Linux commands and scripts as usual

    + Handle multiple inputs/outputs Portable across multiple platforms Fault tolerance
  5. NEXTFLOW • Fast application prototypes • High-level parallelisation model •

    Portable across multiple execution platforms • Enable pipeline reproducibility
  6. • A pipeline script is written by composition putting together

    several process • A process can execute any script or tool • It allows to reuse any existing piece of code
  7. process foo { input: val str from 'Hello' output: file

    'my_file' into result script: """ echo $str world! > my_file """ } PROCESS DEFINITION
  8. WHAT A SCRIPT LOOKS LIKE sequences = Channel.fromPath("/data/sample.fasta") process blast

    { input: file 'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt
 """ } align_result.collectFile(name: 'final_alignment')
  9. IMPLICIT PARALLELISM sequences = Channel.fromPath("/data/*.fasta") process blast { input: file

    'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt
 """ } align_result.collectFile(name: 'final_alignment')
  10. DATAFLOW • Declarative computational model for concurrent processes • Processes

    wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async stream of data called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
  11. • The executor abstraction layer allows you to run the

    same script on different platforms • Local (default) • Cluster (SGE, LSF, SLURM, Torque/PBS) • HPC (beta) • Cloud (beta)
  12. CLUSTER EXECUTOR nextflow login node NFS/GPFS cluster node cluster node

    cluster node cluster node batch scheduler submit tasks cluster node
  13. CONFIGURATION FILE process {
 executor = 'sge' 
 queue =

    'cn-el6'
 memory = '10GB'
 cpus = 8
 time = '2h'
 }
  14. HPC EXECUTOR Login node NFS/GPFS Job request cluster node cluster

    node Job wrapper #!/bin/bash #$ -q <queue> #$ -pe ompi <nodes> #$ -l virtual_free=<mem> mpirun nextflow run <your-pipeline> -with-mpi HPC cluster nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker
  15. BENEFITS • Smaller images (~100MB) • Fast instantiation time (<1sec)

    • Almost native performance • Easy to build, publish, share and deploy • Enable tools versioning and archiving
  16. PROS • Dead easy deployment procedure • Self-contained and precise

    controlled runtime • Rapidly reproduce any former configuration • Consistent results over time and across different platforms
  17. SHIFTER • Alternative implementation developed by NERSC (Berkeley lab) •

    HPC friendly, does not require special permission • Compatible with Docker images • Integrated with SLURM scheduler
  18. • Stop on failure / fix / resume executions •

    Automatically re-execute failing tasks increasing requested resources (memory, disk, etc.) • Ignore task errors
  19. WHO IS USING NEXTFLOW? • Campagne Lab, Weill Medical College

    of Cornell University • Center for Biotechnology, Bielefeld University • Genetic Cancer group, International Agency for Cancer Research • Guigo Lab, Center for Genomic Regulation • Medical genetics diagnostic, Oslo University Hospital • National Marrow Donor Program • Joint Genomic Institute • Parasite Genomics, Sanger Institute
  20. FUTURE WORK Short term • Built-in support for Shifter •

    Enhance scheduling capability of HPC execution mode • Version 1.0 (second half 2016) Long term • Web user interface • Enhance support for cloud (Google Compute Engine)
  21. CONCLUSION • Nextflow is a streaming oriented framework for computational

    workflows. • It is not supposed to replace your favourite tools • It provides a parallel and scalable environment for your scripts • It enables reproducible pipelines deployment