Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large scale genomics with Nextflow and AWS Batch

Paolo Di Tommaso
January 08, 2018
620

Large scale genomics with Nextflow and AWS Batch

This presentation gives an a short introduction about our experience deploying large scale genomic pipelines with Nextflow and AWS Batch cloud service.

Paolo Di Tommaso

January 08, 2018
Tweet

Transcript

  1. LARGE SCALE GENOMICS WITH NEXTFLOW AND AWS BATCH Paolo Di

    Tommaso, CRG (Barcelona) RCUK Cloud Workshop, 8 Jan 2018
  2. WHO IS THIS CHAP? @PaoloDiTommaso Research software engineer Comparative Bioinformatics,

    Notredame Lab Center for Genomic Regulation (CRG) Author of Nextflow project
  3. GENOMIC WORKFLOWS • Data analysis applications to extract information from

    (large) genomic datasets • Mash-up of many different tools and scripts • Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster • Complex dependency trees and configuration → very fragile ecosystem
  4. AWS BATCH • Compute jobs in the cloud in a

    batch fashion (ie. asynchronous) • It manages the provisioning and scaling of the cluster • It provides the concept of queue • A job is a container (cool!)
  5. HOW A TASK LOOKS LIKE? aws s3 cp s3://{bucket}/{sample}_r1.fq .

    aws s3 cp s3://{bucket}/{sample}_r2.fq . aws s3 sync s3://{bucket}/assets/{reference} . bwa mem -t {cpus} {reference}.fa {sample}_r1.fq {sample}_r2.fq \ | samtools sort -o {sample}.bam samtools index {sample}.bam aws s3 cp {sample}.bam s3://{bucket}/ aws s3 cp {sample}.bam.bai s3://{bucket}/
  6. HOW A TASK LOOKS LIKE? • Create a Docker image

    including the job script • Upload the Docker image in a public registry • Create a job template referencing the upload image • Submit the job execution with the AWS command line tool
  7. BOTTLENECKS • The need to handle the input downloads and

    output uploads reduce the workflow portability • Custom container images (ideally we would like to use community container images eg. BioContainers) • Orchestrating big real-world workflows can be challenging
  8. process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq'

    from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } TASK EXAMPLE bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam
  9. TASKS COMPOSITION process index_sample { input: file 'sample.bam' from bam_ch

    output: file 'sample.bai' into bai_ch script: """ samtools index sample.bam """ } process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ }
  10. REACTIVE NETWORK • Declarative computational model for parallel process executions

    • Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async FIFO queues called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
  11. PORTABILITY process { executor = 'slurm' queue = 'my-queue' memory

    = '8 GB' cpus = 4 container = 'user/image' }
  12. PORTABILITY process { executor = 'awsbatch' queue = 'my-queue' memory

    = '8 GB' cpus = 4 container = 'user/image' }
  13. AWS BATCH BENCHMARK • RNA-Seq quantification pipeline * • 375

    samples take from Encode project • 753 Jobs • ~65 i3.xlarge spot instances • ~23 h wall-time time ~4'850 CPU-hours * https://github.com/nextflow-io/rnaseq-encode-nf
  14. THE BILL • $310 (~$1.2 per sample) • $90 Ec2

    spot instances (1'400 Hrs x i3.xlarge) • $220 EBS storage (1 TB x ~1'400 Hrs) • Choosing a better sized EBS volume (~200GB) 
 ⤇ $135 ⤇ ~$0.36 per sample
  15. TAKE HOME MESSAGE • Batch provides a truly scalable elastic

    computing environment for containerised workloads • Delegating the cluster provisioning is a big plus • Choose carefully the size of EBS storage • Nextflow enables the seamless deployment of scalable and portable computational workflows