Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nextflow CRG tutorial

Avatar for Paolo Di Tommaso Paolo Di Tommaso
February 26, 2015
170

Nextflow CRG tutorial

Avatar for Paolo Di Tommaso

Paolo Di Tommaso

February 26, 2015
Tweet

Transcript

  1. WHAT NEXTFLOW IS • A computing runtime which executes Nextflow

    pipeline scripts • A programming DSL that simplify writing of highly parallel computational pipelines reusing your existing scripts and tools
  2. NEXTFLOW DSL • It is NOT a new programming language

    • It extends the Groovy scripting language • It provides a multi-paradigm programming environment
  3. GET STARTED $  cd  ~/crg-­‐course   $  vagrant  up
 $

     vagrant  ssh   Login in your course laptop Once in the virtual machine $  cd  ~/nextflow-­‐tutorial   $  git  pull   $  nextflow  info  
  4. THE BASIC Variables and assignments x  =  1   y

     =  10.5   str  =  'hello  world!'   p  =  x;  q  =  y  
  5. THE BASIC Printing values x  =  1   y  =

     10.5   str  =  'hello  world!'   print  x   print  str   print  str  +  '\n' println  str  
  6. THE BASIC Printing values x  =  1   y  =

     10.5   str  =  'hello  world!'   print(x)   print(str)   print(str  +  '\n') println(str)  
  7. MORE ON STRINGS str  =  'bioinformatics'     print  str[0]

      ! print  "$str  is  cool!"   print  "Current  path:  $PWD" str  =  '''              multi              line                string       ''' ! str  =  """              User:  $USER              Home:  $HOME              """
  8. COMMON STRUCTURES & PROGRAMMING IDIOMS • Data structures: Lists &

    Maps • Control statements: if, for, while, etc. • Functions and classes • File I/O operations
  9. MAIN ABSTRACTIONS • Processes: run any piece of script •

    Channels: unidirectional async queues that allows the processes to comunicate • Operators: transform channels content
  10. CHANNELS • It connects two processes/operators • Write operations is

    NOT blocking • Read operation is blocking • Once an item is read is removed from the queue
  11. CHANNELS some_items  =  Channel.from(10,  20,  30,  ..) my_channel  =  Channel.create()

    single_file  =  Channel.fromPath('some/file/name') more_files  =  Channel.fromPath('some/data/path/*') file x file y file z
  12. OPERATORS • Functions applied to channels • Transform channels content

    • Can be used also to filter, fork and combine channels • Operators can be chained to implement custom behaviours
  13. OPERATORS nums  =  Channel.from(1,2,3,4)   square  =  nums.map  {  it

     -­‐>  it  *  it  } 4            3              2            1 16          9              4            1 nums square map
  14. OPERATORS CHAINING Channel.from(1,2,3,4)       .map  {  it  -­‐>

     [it,  it*it]  }       .subscribe  {  num,  sqr  -­‐>  println  "Square  of:  $num  is  $sqr"  } //  it  prints     Square  of:  1  is  1     Square  of:  2  is  4     Square  of:  3  is  9     Square  of:  4  is  16  
  15. SPLIT FASTA FILE(S) Channel.fromPath('/some/path/fasta.fa')       .splitFasta()    

      .view() Channel.fromPath('/some/path/fasta.fa')       .splitFasta(by:  3)       .view() Channel.fromPath('/some/path/*.fa')       .splitFasta(by:  3)       .view()
  16. SPLITTING OPERATORS You can split text object or files using

    the splitting methods: • splitText - line by line • splitCsv - comma separated values format • splitFasta - by FASTA sequences • splitFastq - by FASTQ sequences
  17. EXAMPLE 1 • Split a FASTA file in sequence •

    Parse a FASTA file and count number of sequences matching specified ID
  18. EXAMPLE 1 $  nextflow  run  channel_split.nf   ! ! $

     nextflow  run  channel_filter.nf  
  19. PROCESS process  sayHello  {   !      input:  

         val  str   !      output:        stdout  into  result   !      script:        """        echo  $str  world!        """   }   ! str  =  Channel.from('hello',  'hola',  'bonjour',  'ciao') result.subscribe  {  print  it  }
  20. PROCESS INPUTS input:      <input  type>  <name>  [from  <source

     channel>]  [attributes] process  procName  {   ! ! ! ! ! ! ! ! !      """        <your  script>        """     ! }
  21. PROCESS INPUTS input:      val    x  from  ch_1

         file  y  from  ch_2      file  'data.fa'  from  ch_3      stdin  from  from  ch_4      set  (x,  'file.txt')  from  ch_5 process  procName  {   ! ! ! ! ! ! ! ! !      """        <your  script>        """     ! }
  22. PROCESS INPUTS proteins  =  Channel.fromPath(  '/some/path/data.fa'  )   ! !

    ! process  blastThemAll  {   !    input:      file  'query.fa'  from  proteins   !    "blastp  -­‐query  query.fa  -­‐db  nr"   ! }   !
  23. PROCESS OUTPUTS process  randomNum  {   !      output:

           file  'result.txt'  into  numbers   ! !      '''        echo  $RANDOM  >  result.txt        '''   ! }   ! ! numbers.subscribe  {  println  "Received:  "  +  it.text  }
  24. USE YOUR FAVOURITE
 PROGRAMMING LANG process  pyStuff  {   !

           script:          """          #!/usr/bin/env  python   !        x  =  'Hello'          y  =  'world!'          print  "%s  -­‐  %s"  %  (x,y)          """   }
  25. EXAMPLE 2 • Execute a process running a BLAST job

    given an input file • Execute a BLAST job emitting the produced output
  26. EXAMPLE 2 $  nextflow  run  process_input.nf   ! ! $

     nextflow  run  process_output.nf  
  27. PIPELINES PARAMETERS params.p1  =  'alpha'   params.p2  =  'beta'  

    : Simply declares some variables prefixed by params When launching your script you can override the default values $  nextflow  run  <script  file>  -­‐-­‐p1  'delta'  -­‐-­‐p2  'gamma'
  28. COLLECT FILE The operator collectFile allows to gather items produced

    by upstream processes my_results.collectFile(name:'result.txt')   Collect all items to a single file
  29. COLLECT FILE The operator collectFile allows to gather items produced

    by upstream processes my_items.collectFile(storeDir:'path/name')  {   !       def  key  =  get_a_key_from_the_item(it)         def  content  =  get_the_item_value(it)         [  key,  content  ]   ! } Collect the items and group them into files having a names defined by a grouping criteria
  30. EXAMPLE 3 • Split a FASTA file, execute a BLAST

    query for each chunk and gather the results • Split multiple FASTA file and execute a BLAST query for each chunk
  31. EXAMPLE 3 $  nextflow  run  split_fasta.nf   ! ! $

     nextflow  run  split_fasta.nf  -­‐-­‐chunkSize  2   ! ! $  nextflow  run  split_fasta.nf  -­‐-­‐chunkSize  2  -­‐-­‐query  data/p\*.fa   ! ! $  nextflow  run  split_and_collect.nf  
  32. UNDERSTANDING MULTIPLE INPUTS task 1 process a out x d

    a c β .. /END/ task 2 out y β d
  33. UNDERSTANDING MULTIPLE INPUTS process a out x d a c

    .. β β d out y β c out z β task 1 task 2 task 3 : task n
  34. CONFIG FILE • Pipeline configuration can be externalised to a

    file named nextflow.config • parameters • environment variables • required resources (mem, cpus, queue, etc) • modules/containers
  35. CONFIG FILE params.p1  =  'alpha'   params.p2  =  'beta'  

    ! env.VAR_1  =  'some_value'   env.CACHE_4_TCOFFEE  =  '/some/path/cache'   env.LOCKDIR_4_TCOFFEE  =  '/some/path/lock'   ! process.executor  =  'sge'
  36. CONFIG FILE params  {      p1  =  'alpha'  

       p2  =  'beta'   }   ! env  {      VAR_1  =  'some_value'      CACHE_4_TCOFFEE  =  '/some/path/cache'      LOCKDIR_4_TCOFFEE  =  '/some/path/lock'   }     ! process  {        executor  =  'sge'   } Alternate syntax (almost) equivalent
  37. HOW USE DOCKER Specify in the config file the Docker

    image to use ! process  {         container  =  <docker  image  ID>   } Add the with-docker flag when launching it ! $  nextflow  run  <script  name>  -­‐with-­‐docker  
  38. HOW USE THE CLUSTER //  default  properties  for  any  process

      process  {     executor  =  'crg'     queue  =  'short'     cpus  =  2       memory  =  '4GB'     scratch  =  true   }   ! ! Define the CRG executor in nextflow.config
  39. PROCESS RESOURCES //  default  properties  for  any  process   process

     {     executor  =  'crg'     queue  =  'short'     scratch  =  true   }   ! //  cpus  for  process  'foo'   process.$foo.cpus  =  2   ! //  resources  for  'bar'     process.$bar.queue  =  'long'   process.$bar.cpus  =  4     process.$bar.memory  =  '4GB'   !
  40. ENVIRONMENT MODULE ! process.$foo.module  =  'Bowtie2/2.2.3'   ! process.$bar.module  =

     'TopHat/2.0.12:Boost/1.55.0'   Specify in the config file the modules required
  41. EXAMPLE 5 $  ssh  username@ant-­‐login.linux.crg.es $  module  avail    

    $  module  purge     $  module  load  nextflow/0.12.3-­‐goolf-­‐1.4.10-­‐no-­‐OFED-­‐Java-­‐1.7.0_21 $  curl  -­‐fsSL  get.nextflow.io  |  bash Login in ANT-LOGIN If you have module configured: Otherwise install it downloading from internet
  42. EXAMPLE 5 Create the following nextflow.config file: process  {  

       executor  =  'crg'      queue  =  'course'      scratch  =  true   } $  nextflow  run  rnatoy  -­‐with-­‐docker  -­‐with-­‐trace Launch the pipeline execution: