Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Detecting and Phasing Small Variants with Highl...

William Rowell
January 16, 2019

Detecting and Phasing Small Variants with Highly Accurate Long Reads

We summarize the challenges around small variant detection for highly accurate (>=99%) long reads and present workflow solutions using existing tools (GATK) and new tools (DeepVariant with trained CCS model). Presented at PacBio SMRT Informatics Workshop in San Diego.

William Rowell

January 16, 2019
Tweet

More Decks by William Rowell

Other Decks in Science

Transcript

  1. For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Detecting and Phasing Small Variants with Highly Accurate Long Reads William Rowell, Senior Scientist, Bioinformatics Applications, PacBio SMRT Informatics Developers Conference, January 16, 2019 @nothingclever
  2. AGENDA -Differences between highly accurate long reads and short reads

    -Calling variants with existing tools -Training new tools on long reads -Making use of phase information to improve variant calls
  3. AGENDA -Differences between highly accurate long reads and short reads

    -Calling variants with existing tools -Training new tools on long reads -Making use of phase information to improve variant calls
  4. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS STARTS WITH A

    FAMILIAR WORKFLOW CCS Reads Align with minimap2 Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls
  5. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls pbmm2 --preset CCS
  6. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls pbmm2 --preset CCS --pcr-indel-model AGGRESSIVE
  7. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls Strand Bias tests ❌ Mapping Quality tests ❌ Read position tests ❌ Variant Quality tests ✅
  8. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls
  9. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls
  10. SMALL VARIANT DETECTION WITH HIGH FIDELITY READS FOLLOWS A FAMILIAR

    WORKFLOW CCS Reads Align with minimap2 (pbmm2) Detect variants with GATK HaplotypeCaller Hard filter variants with GATK VariantFiltration Diploid variant calls pbmm2 --preset CCS --pcr-indel-model AGGRESSIVE SNV → QD >= 2.0 1 bp Indels → QD >= 5.0 >1 bp Indels → QD >= 2.0 Precision Recall F1 SNVs 99.468% 99.559% 99.513% Indels 78.977% 81.248% 80.097%
  11. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS

    Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Nature Biotechnology volume 36, pages 983–987 (2018)
  12. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS

    Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Nature Biotechnology volume 36, pages 983–987 (2018) CCS Reads DeepVariant + CCS model Diploid variant calls
  13. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS

    Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Nature Biotechnology volume 36, pages 983–987 (2018) CCS Reads DeepVariant + CCS model Diploid variant calls Precision Recall F1 SNVs 99.914% 99.959% 99.936% Indels 96.901% 95.980% 96.438% autosomes
  14. GOOGLE’S DEEPVARIANT CAN BE TRAINED ON NEW DATA TYPES CCS

    Reads (chr. 1-19) + GIAB Truth Set DeepVariant CNN training CCS model Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 36, 983–987 (2018) CCS Reads DeepVariant + CCS model Diploid variant calls Precision Recall F1 SNVs 99.914% 99.959% 99.936% Indels 96.901% 95.980% 96.438% Precision Recall F1 SNVs 99.807% 99.904% 99.855% Indels 95.387% 94.501% 94.942% autosomes chromosome 20
  15. INCORPORATING PHASE INFORMATION LEADS TO IMPROVEMENTS IN VARIANT RECALL AND

    PRECISION Ebler, J. et al. Haplotype-aware genotyping from noisy long reads bioRxiv doi: 10.1101/293944 Precision Recall F1 SNVs 99.468% 99.559% 99.513% Indels 78.977% 81.248% 80.097% Precision Recall F1 SNVs 99.693% 99.792% 99.742% Indels 81.102% 83.818% 82.438% GATK HC GATK HC + WhatsHap Incorporate phase information and re-genotype variant positions with WhatsHap
  16. INCORPORATING PHASE INFORMATION LEADS TO MODEST INCREASES IN VARIANT RECALL

    AND PRECISION Precision Recall F1 SNVs 99.468% 99.559% 99.513% Indels 78.977% 81.248% 80.097% Precision Recall F1 SNVs 99.693% 99.792% 99.742% Indels 81.102% 83.818% 82.438% Precision Recall F1 SNVs 99.914% 99.959% 99.936% Indels 96.901% 95.980% 96.438% Precision Recall F1 SNVs 99.904% 99.963% 99.934% Indels 97.835% 97.141% 97.486% GATK HC GATK HC + WhatsHap DeepVariant CCS DeepVariant CCS + haplotype sorting Incorporate phase information and re-genotype variant positions with WhatsHap Tag reads with phase information from trio data and sort alignments by haplotype https://goo.gl/4cnMeC Ebler, J. et al. Haplotype-aware genotyping from noisy long reads bioRxiv doi: 10.1101/293944
  17. CONCLUSIONS -GATK HaplotypeCaller (optimized for short reads) can be used

    to detect SNVs with high recall and precision, but has trouble discriminating between biological indels and sequencing errors. -DeepVariant, when trained on long reads, can be used to detect both SNVs and indels with high recall and precision. -Both workflows can be improved by providing long-distance phasing information, but there’s still work to be done in this area.
  18. ACKNOWLEDGEMENTS Google AI Genomics Alexey Kolesnikov Pi-Chuan Chang Andrew Carroll

    Mark DePristo Saarland University/Max Planck Institute for Informatics Jana Ebler Tobias Marschall PacBio Yufeng Qian Richard Hall Aaron Wenger Paul Peluso David Rank Mike Hunkapiller DNANexus Jason Chin NIST Nathan Olson Justin Zook
  19. For Research Use Only. Not for use in diagnostic procedures.

    © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Advanced Analytical Technologies. All other trademarks are the sole property of their respective owners. www.pacb.com