Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pan-genomic methods for fighting reference bias

Pan-genomic methods for fighting reference bias

Given at Penn State University CSE Colloquium on March 19, 2024

Ben Langmead

March 19, 2024
Tweet

More Decks by Ben Langmead

Other Decks in Research

Transcript

  1. Ben Langmead Associate Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead

    Penn State CSE Colloquium, 3/19/2024 Pan-genomic methods for fighting reference bias
  2. More variation 1000 Genomes Project Consortium, Auton, A., Brooks, L.

    D., Durbin, R. M., Garrison, E. P., Kang, H. M., … Abecasis, G. R. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. AFR EAS AMR EUR SAS
  3. Reference bias REF ALT Gene 1 (bias PAT) → Gene

    2 (strong bias MAT) → MAT PAT Confounder in allele-specific analyses
  4. Reference bias Degner, J. F., Marioni, J. C., Pai, A.

    A., Pickrell, J. K., Nkadori, E., Gilad, Y., & Pritchard, J. K. (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics, 25(24), 3207–3212 Confounder in allele-specific analyses
  5. Reference bias Pritt, J., Chen, N. C., Langmead, B. (2018).

    FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. (Poor coverage in MHC region) Confounder in hypervariable regions
  6. Reference bias Wulfridge, P., Langmead, B., Feinberg, A. P., &

    Hansen, K. D. (2019). Analyzing whole genome bisulfite sequencing data from highly divergent genotypes. Nucleic acids research, 47(19), e117. Confounder when comparing inbred strains Using BL6 reference Using CAST reference
  7. Why attack reference bias? To avoid a world where diagnostics

    & therapeutics are differentially effective by population "...without a more representative reference genome, genetic medicine will never reach some ethnic groups, warns genome scientist Alicia Martin of Mass. General."
  8. Outline Why the Burrows-Wheeler Transform is key to indexing pangenomes

    How r-index emerged What r-index can already do How to push r-index (& BWT) further in the future Graphs? Why "linear" pangenomes?
  9. Outline Why the Burrows-Wheeler Transform is key to indexing pangenomes

    How r-index emerged What r-index can already do How to push r-index (& BWT) further in the future Graphs? Why "linear" pangenomes?
  10. A pangenome graph "Variation graphs... which encode the genetic variation

    within a population as a graph, have been proposed as a solution to the reference bias [problem]." Quote: Sirén, Jouni. "Indexing variation graphs." In 2017 Proceedings of the nineteenth workshop on algorithm engineering and experiments (ALENEX), pp. 13-27. Society for Industrial and Applied Mathematics, 2017. Image: Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E.T., … Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9), 875–879.
  11. Graph & linear pangenomes Pritt J, Chen NC, Langmead B.

    FORGe: prioritizing variants for graph genomes. Genome Biol. 2018 Dec 17;19(1):220 Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.
  12. Graph & linear pangenomes Pritt J, Chen NC, Langmead B.

    FORGe: prioritizing variants for graph genomes. Genome Biol. 2018 Dec 17;19(1):220
  13. Graph & linear pangenomes • Peak accuracy is at ~10%

    of variants, or ≥5% allele freq • Pangenome accuracy approaches but does not reach that of personalized Pritt J, Chen NC, Langmead B. FORGe: prioritizing variants for graph genomes. Genome Biol. 2018 Dec 17;19(1):220
  14. Graph & linear pangenomes Chen NC, Solomon B, Mun T,

    Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.
  15. Graph & linear pangenomes • Handful of population- specific linear

    references competes favorably with graph-based pangenome • But misses rare alleles • How big can we go? 1 linear genome (no bias) 6 linear genomes Pangenome graph Personalized Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.
  16. Outline Why the Burrows-Wheeler Transform is key to indexing pangenomes

    How r-index emerged What r-index can already do How to push r-index (& BWT) further in the future Graphs? Why "linear" pangenomes?
  17. Burrows Wheeler Transform T BWT(T) a b a a b

    a $ a b b a $ a a Burrows Wheeler Transform (BWT) reorders T's letters according to alphabetical order of their right contexts in T (e.g. genome) Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.
  18. Burrows Wheeler Transform T BWT(T) a b a a b

    a $ a b b a $ a a (e.g. genome) For playing a game For finding a missing card
  19. FM Index $ a b a a b a a

    $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a T All rotations Sort BWT(T) Last column Burrows-Wheeler Matrix a b a a b a $ a b b a $ a a FM index behind Bowtie & BWA consists of Burrows- Wheeler Transform (BWT), plus auxiliary structures (e.g. genome) Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.
  20. FM Index matching $ a b a a b a0

    a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L P = aba aba L $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F P = aba $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L P = aba T = abaaba M atch M atch In successive steps, Match queries find rows having suffixes of P as a prefix
  21. P = aba aba P = aba P = aba

    FM Index matching $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L L $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F In successive steps, Match queries find rows having suffixes of P as a prefix $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L T = abaaba M atch M atch Locate Locate Locate queries find where those matches occurred in T
  22. P = aba aba P = aba P = aba

    FM Index matching $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L L $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L T = abaaba M atch M atch Locate Locate In successive steps, Match queries find rows having suffixes of P as a prefix Locate queries find where those matches occurred in T
  23. BWT: compression BWT gathers “like” characters (sharing right context) into

    runs T rectangular_rectangle_divided_into_rectangles$ BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi E.g. ectangle tends to be preceded by r
  24. BWT gathers “like” characters (sharing right context) into runs E.g.

    ectangle tends to be preceded by r These rs come together in a BWT run T rectangular_rectangle_divided_into_rectangles$ BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi BWT: compression
  25. BWT: compression row_row_row_your_boat row_row_row_your_boat row_row_row_your_boat$ trrrwwwwwwwwwooo___bbbyyyrrrrrrrrruuutt$______aaaoooooooooooo___ (t, 1), (r, 3),

    (w, 9), (o, 3), (_, 3), (b, 3), (y, 3), (r, 9), (u, 3), (t, 2), ($, 1), (_, 6), (a, 3), (o, 12), (_, 3) 15 runs, avg length = 4.27 BWT RLE
  26. BWT: compression row_rxw_row_your_boat row_row_row_zour_boat row_row_row_your_boaq$ qrrrwwwwwwwwwooo___bbbyzyrrrrrrrrauuutt__$____aaooooooxooooor___ (q, 1), (r, 3),

    (w, 9), (o, 3), (_, 3), (b, 3), (y, 1), (z, 1), (y, 1), (r, 8), (a, 1), (u, 3), (t, 2), (_, 2), ($, 1), (_, 4), (a, 2), (o, 6), (x, 1), (o, 5), (_, 3) BWT RLE 21 runs, avg length = 3.05
  27. # genomes 1 6,072 M 3,264 M 2 12,144 M

    3,282 M 3 18,217 M 3,386 M 4 24,408 M 3,423 M 5 30,480 M 3,436 M 6 36,671 M 3,449 M n r As we index more diploid genomes, (length) grows linearly while (# BWT runs) grows more slowly n r From 1000 Genomes project phase-3 callset Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. BWT: compression
  28. FM: Ferragina P, and Manzini M. Opportunistic data structures with

    applications. Proceedings of 41st FOCS. IEEE, 2000. Compressed indexing Match Locate FM Index (2000) Where = # characters, is # BWT runs n r O(n) O(n) Needed: indexes that grow with i.e. use space and handle our favorite queries r O(r)
  29. Compressed indexing Match Locate FM Index (2000) RLFM Index (2005)

    O(n) O(n) RLFM: Mäkinen V, and Navarro G. Succinct suffix arrays based on run-length encoding. Annual Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56. O(r) O(n) Needed: indexes that grow with i.e. use space and handle our favorite queries r O(r) FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000. Where = # characters, is # BWT runs n r
  30. r-index: Gagie T, Navarro G, and Prezza P. Optimal-time text

    indexing in BWT-runs bounded space. Proceedings of 29th SODA, ACM-SIAM. 2018. pp1459—1477. RLFM: Mäkinen V, and Navarro G. Succinct suffix arrays based on run-length encoding. Annual Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56. FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000. Compressed indexing Match Locate FM Index (2000) RLFM Index (2005) r-index (2018) O(n) O(r) O(r) O(n) O(n) O(r) ✅ Needed: indexes that grow with i.e. use space and handle our favorite queries r O(r) Where = # characters, is # BWT runs n r
  31. r-index in practice With additional advances in construction (prefix-free parsing),

    we can index a human pangenome and compete with Bowtie Christina Boucher Travis Gagie Alan Kuhnle Giovanni Manzini Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. Taher Mun Boucher C, Gagie T, Kuhnle A, Langmead B, Manzini G, Mun T. Prefix-free parsing for building big BWTs. Algorithms Mol Biol. 2019 May 24;14:13.
  32. r-index in practice Kuhnle A, Mun T, Boucher C, Gagie

    T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. (chr19s from 1000 Genomes Project)
  33. r-index in practice (chr19s from 1000 Genomes Project) Kuhnle A,

    Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.
  34. Outline Why the Burrows-Wheeler Transform is key to indexing pangenomes

    How r-index emerged What r-index can already do How to push r-index (& BWT) further in the future Graphs? Why "linear" pangenomes?
  35. MONI & SPUMONI Full-text pangenome indexes hold particular promise for

    classification, including in real time (adaptive sequencing) How best to do this with r-index? 1 2 3 4 5 2 3 4 5 3 1 2 3 0 1 0 0 1 0 0
  36. T : a b r a c a d a

    b r a d a d P : c a r a d a d a b r d M S : 2 1 5 4 3 5 4 3 2 1 1 MONI Max Rossi MONI extends r-index to compute matching statistics (& MEMs) = length of the longest prefix of the th suffix that occurs in T MS[i] i 6 5 4 3 2 1 Rossi M, Oliva M, Langmead B, Gagie T, Boucher C. MONI: A Pangenomic Index for Finding Maximal Exact Matches. J Comput Biol. 2022 Feb;29(2):169-187.
  37. SPUMONI SPUMONI achieves a much smaller index and faster query

    speed compared to MONI by 1 2 3 4 5 2 3 4 5 3 1 2 3 0 1 0 0 1 0 0 Giving up on full-fidelity matching statistics... ...relying using Pseudo Matching Lengths or PMLs, which tend to smooth out lower peaks and preserve larger ones Ahmed O, Rossi M, Kovaka S, Schatz MC, Gagie T, Boucher C, Langmead B. Pan-genomic matching statistics for targeted nanopore sequencing. iScience. 2021 Jun 8;24(6):102696.
  38. SPUMONI Omar Ahmed Max Rossi Ahmed O, Rossi M, Kovaka

    S, Schatz MC, Gagie T, Boucher C, Langmead B. Pan-genomic matching statistics for targeted nanopore sequencing. iScience. 2021 Jun 8;24(6):102696.
  39. SPUMONI To classify, ask: is the distribution of matching statistics

    significantly different than for a "null" read? Human read + human index = can reject null Human read + bacterial index = cannot reject null Ahmed O, Rossi M, Kovaka S, Schatz MC, Gagie T, Boucher C, Langmead B. Pan-genomic matching statistics for targeted nanopore sequencing. iScience. 2021 Jun 8;24(6):102696.
  40. SPUMONI for nanopore adaptive sampling 4.2X lower memory footprint 16.3X

    smaller index 11.9X faster Comparable accuracy Compared to minimap2 classifier: (Total of 3,537 bacterial genomes)
  41. Outline Why the Burrows-Wheeler Transform is key to indexing pangenomes

    How r-index emerged What r-index can already do How to push r-index (& BWT) further in the future Graphs? Why "linear" pangenomes?
  42. Spectrum 1: Personalization Where practical, personalized references will always be

    the champions 5x coverage is mostly sufficient to make accurate personalized reference 99.00% 99.25% 99.50% 99.75% 98.50% 98.75% 99.00% 99.25% 99.50% 99.75% Precision Recall 0.9902 0.9921 0.9928 0.9943 0.9944 0.9945 0.9945 0.985 0.990 0.995 1.000 F1 score BWA−MEM Giraffe (linear) Giraffe (pangenome) Giraffe* (rgc1) Giraffe* (rgc5) Giraffe* (bbgc5) Giraffe* (bbbc5) Vaddadi NSK, Mun T, Langmead B, Minimizing Reference Bias with an Impute-First Approach. 2023. bioRxiv. doi:10.1101/2023.11.30.568362 Taher Mun Naga Sai Kavya Vaddadi Personalized refs outperform graph pangenomes up- and downstream
  43. Spectrum 2: Speeds r-index / SPUMONI Larger index, same accuracy

    Smaller index, less accuracy Lossy Lossless SPUMONI 2 MOVI
  44. Minimizer digestion + O(r) compression, -> speed, smaller indexes, (controllably)

    less accuracy Spectrum 2: SPUMONI 2 0.0 0.2 0.4 0.6 0.8 1.0 8 10 12 14 16 18 20 22 24 26 28 30 Window size (w) Relative Minimizer Index Size a 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 8 10 12 14 16 18 20 22 24 26 28 30 Window size (w) Speed−up Alphabet DNA Minimizer b Omar Ahmed Ahmed OY, Rossi M, Gagie T, Boucher C, Langmead B. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology. 2023 May 18;24(1):122. 4.2x 10x lower memory footprint → 16x 68x smaller index → 12X 15x faster → Compared to minimap2 classifier
  45. 0 2000 4000 6000 Movi SPUMONI PML per base (ns)

    Spectrum 2: MOVI & the Move structure Travis Gagie Omar Ahmed Mohsen Zakeri Nate Brown Move structure: like r-index but radically rearranged to achieve locality of reference, giving: fewer cache misses, and... ...much greater speed Zakeri M, Brown N, Amhed O, Gagie T, Langmead B. Movi: a fast and cache-efficient full-text pangenome index. 2023. bioRxiv. doi:10.1101/2023.11.04.565615 1.61 23.3 0 5 10 15 20 Movi SPUMONI index size (GB) cache misses per base Movi SPUMONI Movi SPUMONI ns per base
  46. Start Do the characters on the read and the run

    match? reposition reposition up or down based on the thresholds medium cost LF update run index and offset fast- forward need to 
 fast-foward? high cost yes low cost increment the run index no no yes Spectrum 2: MOVI & the Move structure Zakeri M, Brown N, Amhed O, Gagie T, Langmead B. Movi: a fast and cache-efficient full-text pangenome index. 2023. bioRxiv. doi:10.1101/2023.11.04.565615 A memory prefetch at this high-cost memory access, together with concurrent processing of many reads, improves speed by 2--3-fold
  47. Spectrum 2: MOVI & the Move structure Zakeri M, Brown

    N, Amhed O, Gagie T, Langmead B. Movi: a fast and cache-efficient full-text pangenome index. 2023. bioRxiv. doi:10.1101/2023.11.04.565615 • More in this talk if you are interested: https://youtu.be/t7luD8Wnk7w?si=Gb9EVfCjznjiUeSc
  48. Spectrum 3: Graphs Strings ↔ Wheeler Graphs are (roughly) "those

    graphs that are amenable to indexing" http://dx.doi.org/10.1016/j.tcs.2017.06.016 Strings Tries De Bruin Graphs etc
  49. Strings Biological graphs Virtue of graph pangenomes is their biologically

    driven & complete way of "collapsing" the sequences Creates a coordinate system & saves effort later Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E.T., … Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9), 875–879. Spectrum 3: Graphs Strings ↔
  50. Strings Biological graphs Wheeler ✅ Strings with tunneling Collapsed ✅

    Baier, Uwe, and Kadir Dede. "BWT Tunnel Planning is hard but manageable." 2019 Data Compression Conference (DCC). IEEE, 2019. In future, the virtues of collapsing & guaranteed Wheelerness could be enjoyed without multiple alignment or De Bruijn graphs Spectrum 3: Graphs Strings ↔
  51. Conclusions Used with multiple references, linear aligners can avoid bias

    like graph aligners Assemblies are coming; but human genomes are structurally varied & resist "common coordinates" Graphs are part of the answer, but we also need methods that let genomes be linear
  52. https://bit.ly/yt_index • Pan-genomic theory & practice are growing together •

    Formalisms like Wheeler Graph & r-index are maturing at the right time, but could use more attention • Technical videos available in my "Indexing" playlist (link below) and all over my YouTube channel Conclusions
  53. Thank you! Thanks to the team: DBI-2029552 Christina Boucher Travis

    Gagie Alan Kuhnle Giovanni Manzini Jacob Pritt Nae-Chyun Chen Taher Mun Brad Solomon Sheila Iyer Omar Ahmed Max Rossi Sam Kovaka Mike Schatz Marco Oliva Mohsen Zakeri Nate Brown R01HG011392 R35GM139602 Naga Sai Kavya Vaddadi NIH: NSF: Vikram Shivakuma