Pan-genomic methods for fighting reference bias

Ben Langmead Associate Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead
Penn State CSE Colloquium, 3/19/2024 Pan-genomic methods for fighting reference bias

Human genome Image: Russ London, https://commons.wikimedia.org/wiki/File:Wellcome_genome_bookcase.png

More variation 1000 Genomes Project Consortium, Auton, A., Brooks, L.
D., Durbin, R. M., Garrison, E. P., Kang, H. M., … Abecasis, G. R. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. AFR EAS AMR EUR SAS

More variation

Reference bias Tendency to miss or misalign reads containing non-reference
alleles REF ALT No bias Bias against Ref:

Reference bias REF ALT Gene 1 (bias PAT) → Gene
2 (strong bias MAT) → MAT PAT Confounder in allele-specific analyses

Reference bias Degner, J. F., Marioni, J. C., Pai, A.
A., Pickrell, J. K., Nkadori, E., Gilad, Y., & Pritchard, J. K. (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics, 25(24), 3207–3212 Confounder in allele-specific analyses

Reference bias Pritt, J., Chen, N. C., Langmead, B. (2018).
FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. (Poor coverage in MHC region) Confounder in hypervariable regions

Reference bias Wulfridge, P., Langmead, B., Feinberg, A. P., &
Hansen, K. D. (2019). Analyzing whole genome bisulfite sequencing data from highly divergent genotypes. Nucleic acids research, 47(19), e117. Confounder when comparing inbred strains Using BL6 reference Using CAST reference

Why attack reference bias? To avoid a world where diagnostics
& therapeutics are differentially effective by population "...without a more representative reference genome, genetic medicine will never reach some ethnic groups, warns genome scientist Alicia Martin of Mass. General."

Outline Why the Burrows-Wheeler Transform is key to indexing pangenomes
How r-index emerged What r-index can already do How to push r-index (& BWT) further in the future Graphs? Why "linear" pangenomes?

A pangenome graph "Variation graphs... which encode the genetic variation
within a population as a graph, have been proposed as a solution to the reference bias [problem]." Quote: Sirén, Jouni. "Indexing variation graphs." In 2017 Proceedings of the nineteenth workshop on algorithm engineering and experiments (ALENEX), pp. 13-27. Society for Industrial and Applied Mathematics, 2017. Image: Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E.T., … Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9), 875–879.

Graph & linear pangenomes Pritt J, Chen NC, Langmead B.
FORGe: prioritizing variants for graph genomes. Genome Biol. 2018 Dec 17;19(1):220 Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.

Graph & linear pangenomes Pritt J, Chen NC, Langmead B.
FORGe: prioritizing variants for graph genomes. Genome Biol. 2018 Dec 17;19(1):220

Graph & linear pangenomes • Peak accuracy is at ~10%
of variants, or ≥5% allele freq • Pangenome accuracy approaches but does not reach that of personalized Pritt J, Chen NC, Langmead B. FORGe: prioritizing variants for graph genomes. Genome Biol. 2018 Dec 17;19(1):220

Graph & linear pangenomes Chen NC, Solomon B, Mun T,
Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.

Graph & linear pangenomes • Handful of population- specific linear
references competes favorably with graph-based pangenome • But misses rare alleles • How big can we go? 1 linear genome (no bias) 6 linear genomes Pangenome graph Personalized Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.

Burrows Wheeler Transform T BWT(T) a b a a b
a $ a b b a $ a a Burrows Wheeler Transform (BWT) reorders T's letters according to alphabetical order of their right contexts in T (e.g. genome) Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.

Burrows Wheeler Transform T BWT(T) a b a a b
a $ a b b a $ a a (e.g. genome) For playing a game For finding a missing card

FM Index $ a b a a b a a
$ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a T All rotations Sort BWT(T) Last column Burrows-Wheeler Matrix a b a a b a $ a b b a $ a a FM index behind Bowtie & BWA consists of Burrows- Wheeler Transform (BWT), plus auxiliary structures (e.g. genome) Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.

FM Index matching $ a b a a b a0
a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L P = aba aba L $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F P = aba $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L P = aba T = abaaba M atch M atch In successive steps, Match queries find rows having suffixes of P as a prefix

P = aba aba P = aba P = aba
FM Index matching $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L L $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F In successive steps, Match queries find rows having suffixes of P as a prefix $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L T = abaaba M atch M atch Locate Locate Locate queries find where those matches occurred in T

P = aba aba P = aba P = aba
FM Index matching $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L L $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 F L T = abaaba M atch M atch Locate Locate In successive steps, Match queries find rows having suffixes of P as a prefix Locate queries find where those matches occurred in T

BWT: compression BWT gathers “like” characters (sharing right context) into
runs T rectangular_rectangle_divided_into_rectangles$ BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi E.g. ectangle tends to be preceded by r

BWT gathers “like” characters (sharing right context) into runs E.g.
ectangle tends to be preceded by r These rs come together in a BWT run T rectangular_rectangle_divided_into_rectangles$ BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi BWT: compression

BWT: compression row_row_row_your_boat row_row_row_your_boat row_row_row_your_boat$ BWT

BWT: compression row_row_row_your_boat row_row_row_your_boat row_row_row_your_boat$ trrrwwwwwwwwwooo___bbbyyyrrrrrrrrruuutt$______aaaoooooooooooo___ (t, 1), (r, 3),
(w, 9), (o, 3), (_, 3), (b, 3), (y, 3), (r, 9), (u, 3), (t, 2), ($, 1), (_, 6), (a, 3), (o, 12), (_, 3) 15 runs, avg length = 4.27 BWT RLE

BWT: compression row_rxw_row_your_boat row_row_row_zour_boat row_row_row_your_boaq$ + variation

BWT: compression row_rxw_row_your_boat row_row_row_zour_boat row_row_row_your_boaq$ qrrrwwwwwwwwwooo___bbbyzyrrrrrrrrauuutt__$____aaooooooxooooor___ (q, 1), (r, 3),
(w, 9), (o, 3), (_, 3), (b, 3), (y, 1), (z, 1), (y, 1), (r, 8), (a, 1), (u, 3), (t, 2), (_, 2), ($, 1), (_, 4), (a, 2), (o, 6), (x, 1), (o, 5), (_, 3) BWT RLE 21 runs, avg length = 3.05

BWT: compression row_rxw_row_your_boat row_row_row_zour_boat row_row_row_your_boaq$ $ t a c a
::

# genomes 1 6,072 M 3,264 M 2 12,144 M
3,282 M 3 18,217 M 3,386 M 4 24,408 M 3,423 M 5 30,480 M 3,436 M 6 36,671 M 3,449 M n r As we index more diploid genomes, (length) grows linearly while (# BWT runs) grows more slowly n r From 1000 Genomes project phase-3 callset Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. BWT: compression

FM: Ferragina P, and Manzini M. Opportunistic data structures with
applications. Proceedings of 41st FOCS. IEEE, 2000. Compressed indexing Match Locate FM Index (2000) Where = # characters, is # BWT runs n r O(n) O(n) Needed: indexes that grow with i.e. use space and handle our favorite queries r O(r)

Compressed indexing Match Locate FM Index (2000) RLFM Index (2005)
O(n) O(n) RLFM: Mäkinen V, and Navarro G. Succinct suffix arrays based on run-length encoding. Annual Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56. O(r) O(n) Needed: indexes that grow with i.e. use space and handle our favorite queries r O(r) FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000. Where = # characters, is # BWT runs n r

r-index: Gagie T, Navarro G, and Prezza P. Optimal-time text
indexing in BWT-runs bounded space. Proceedings of 29th SODA, ACM-SIAM. 2018. pp1459—1477. RLFM: Mäkinen V, and Navarro G. Succinct suffix arrays based on run-length encoding. Annual Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56. FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000. Compressed indexing Match Locate FM Index (2000) RLFM Index (2005) r-index (2018) O(n) O(r) O(r) O(n) O(n) O(r) ✅ Needed: indexes that grow with i.e. use space and handle our favorite queries r O(r) Where = # characters, is # BWT runs n r

r-index in practice With additional advances in construction (prefix-free parsing),
we can index a human pangenome and compete with Bowtie Christina Boucher Travis Gagie Alan Kuhnle Giovanni Manzini Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. Taher Mun Boucher C, Gagie T, Kuhnle A, Langmead B, Manzini G, Mun T. Prefix-free parsing for building big BWTs. Algorithms Mol Biol. 2019 May 24;14:13.

r-index in practice Kuhnle A, Mun T, Boucher C, Gagie
T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. (chr19s from 1000 Genomes Project)

r-index in practice (chr19s from 1000 Genomes Project) Kuhnle A,
Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.

MONI & SPUMONI Full-text pangenome indexes hold particular promise for
classification, including in real time (adaptive sequencing) How best to do this with r-index? 1 2 3 4 5 2 3 4 5 3 1 2 3 0 1 0 0 1 0 0

T : a b r a c a d a
b r a d a d P : c a r a d a d a b r d M S : 2 1 5 4 3 5 4 3 2 1 1 MONI Max Rossi MONI extends r-index to compute matching statistics (& MEMs) = length of the longest prefix of the th suffix that occurs in T MS[i] i 6 5 4 3 2 1 Rossi M, Oliva M, Langmead B, Gagie T, Boucher C. MONI: A Pangenomic Index for Finding Maximal Exact Matches. J Comput Biol. 2022 Feb;29(2):169-187.

SPUMONI SPUMONI achieves a much smaller index and faster query
speed compared to MONI by 1 2 3 4 5 2 3 4 5 3 1 2 3 0 1 0 0 1 0 0 Giving up on full-fidelity matching statistics... ...relying using Pseudo Matching Lengths or PMLs, which tend to smooth out lower peaks and preserve larger ones Ahmed O, Rossi M, Kovaka S, Schatz MC, Gagie T, Boucher C, Langmead B. Pan-genomic matching statistics for targeted nanopore sequencing. iScience. 2021 Jun 8;24(6):102696.

SPUMONI Omar Ahmed Max Rossi Ahmed O, Rossi M, Kovaka
S, Schatz MC, Gagie T, Boucher C, Langmead B. Pan-genomic matching statistics for targeted nanopore sequencing. iScience. 2021 Jun 8;24(6):102696.

SPUMONI To classify, ask: is the distribution of matching statistics
significantly different than for a "null" read? Human read + human index = can reject null Human read + bacterial index = cannot reject null Ahmed O, Rossi M, Kovaka S, Schatz MC, Gagie T, Boucher C, Langmead B. Pan-genomic matching statistics for targeted nanopore sequencing. iScience. 2021 Jun 8;24(6):102696.

SPUMONI for nanopore adaptive sampling 4.2X lower memory footprint 16.3X
smaller index 11.9X faster Comparable accuracy Compared to minimap2 classifier: (Total of 3,537 bacterial genomes)

Spectrum 1: Personalization Linear Personalized Pangenome

Spectrum 1: Personalization Where practical, personalized references will always be
the champions 5x coverage is mostly sufficient to make accurate personalized reference 99.00% 99.25% 99.50% 99.75% 98.50% 98.75% 99.00% 99.25% 99.50% 99.75% Precision Recall 0.9902 0.9921 0.9928 0.9943 0.9944 0.9945 0.9945 0.985 0.990 0.995 1.000 F1 score BWA−MEM Giraffe (linear) Giraffe (pangenome) Giraffe* (rgc1) Giraffe* (rgc5) Giraffe* (bbgc5) Giraffe* (bbbc5) Vaddadi NSK, Mun T, Langmead B, Minimizing Reference Bias with an Impute-First Approach. 2023. bioRxiv. doi:10.1101/2023.11.30.568362 Taher Mun Naga Sai Kavya Vaddadi Personalized refs outperform graph pangenomes up- and downstream

Spectrum 2: Speeds r-index / SPUMONI Larger index, same accuracy
Smaller index, less accuracy Lossy Lossless SPUMONI 2 MOVI

Minimizer digestion + O(r) compression, -> speed, smaller indexes, (controllably)
less accuracy Spectrum 2: SPUMONI 2 0.0 0.2 0.4 0.6 0.8 1.0 8 10 12 14 16 18 20 22 24 26 28 30 Window size (w) Relative Minimizer Index Size a 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 8 10 12 14 16 18 20 22 24 26 28 30 Window size (w) Speed−up Alphabet DNA Minimizer b Omar Ahmed Ahmed OY, Rossi M, Gagie T, Boucher C, Langmead B. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology. 2023 May 18;24(1):122. 4.2x 10x lower memory footprint → 16x 68x smaller index → 12X 15x faster → Compared to minimap2 classifier

0 2000 4000 6000 Movi SPUMONI PML per base (ns)
Spectrum 2: MOVI & the Move structure Travis Gagie Omar Ahmed Mohsen Zakeri Nate Brown Move structure: like r-index but radically rearranged to achieve locality of reference, giving: fewer cache misses, and... ...much greater speed Zakeri M, Brown N, Amhed O, Gagie T, Langmead B. Movi: a fast and cache-efﬁcient full-text pangenome index. 2023. bioRxiv. doi:10.1101/2023.11.04.565615 1.61 23.3 0 5 10 15 20 Movi SPUMONI index size (GB) cache misses per base Movi SPUMONI Movi SPUMONI ns per base

Start Do the characters on the read and the run
match? reposition reposition up or down based on the thresholds medium cost LF update run index and oﬀset fast- forward need to   fast-foward? high cost yes low cost increment the run index no no yes Spectrum 2: MOVI & the Move structure Zakeri M, Brown N, Amhed O, Gagie T, Langmead B. Movi: a fast and cache-efﬁcient full-text pangenome index. 2023. bioRxiv. doi:10.1101/2023.11.04.565615 A memory prefetch at this high-cost memory access, together with concurrent processing of many reads, improves speed by 2--3-fold

Spectrum 2: MOVI & the Move structure Zakeri M, Brown
N, Amhed O, Gagie T, Langmead B. Movi: a fast and cache-efﬁcient full-text pangenome index. 2023. bioRxiv. doi:10.1101/2023.11.04.565615 • More in this talk if you are interested: https://youtu.be/t7luD8Wnk7w?si=Gb9EVfCjznjiUeSc

Spectrum 3: Graphs Strings ↔ Wheeler Graphs are (roughly) "those
graphs that are amenable to indexing" http://dx.doi.org/10.1016/j.tcs.2017.06.016 Strings Tries De Bruin Graphs etc

Strings Biological graphs Virtue of graph pangenomes is their biologically
driven & complete way of "collapsing" the sequences Creates a coordinate system & saves effort later Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E.T., … Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9), 875–879. Spectrum 3: Graphs Strings ↔

Strings Biological graphs Wheeler ✅ Strings with tunneling Collapsed ✅
Baier, Uwe, and Kadir Dede. "BWT Tunnel Planning is hard but manageable." 2019 Data Compression Conference (DCC). IEEE, 2019. In future, the virtues of collapsing & guaranteed Wheelerness could be enjoyed without multiple alignment or De Bruijn graphs Spectrum 3: Graphs Strings ↔

Conclusions Used with multiple references, linear aligners can avoid bias
like graph aligners Assemblies are coming; but human genomes are structurally varied & resist "common coordinates" Graphs are part of the answer, but we also need methods that let genomes be linear

https://bit.ly/yt_index • Pan-genomic theory & practice are growing together •
Formalisms like Wheeler Graph & r-index are maturing at the right time, but could use more attention • Technical videos available in my "Indexing" playlist (link below) and all over my YouTube channel Conclusions

Thank you! Thanks to the team: DBI-2029552 Christina Boucher Travis
Gagie Alan Kuhnle Giovanni Manzini Jacob Pritt Nae-Chyun Chen Taher Mun Brad Solomon Sheila Iyer Omar Ahmed Max Rossi Sam Kovaka Mike Schatz Marco Oliva Mohsen Zakeri Nate Brown R01HG011392 R35GM139602 Naga Sai Kavya Vaddadi NIH: NSF: Vikram Shivakuma

Pan-genomic methods for fighting reference bias

Pan-genomic methods for fighting reference bias

More Decks by Ben Langmead

Other Decks in Research

Featured

Transcript