WCS-LA-2024

@lcolladotor lcolladotor.github.io lcolladotor.github.io/bioc_team_ds Benchmarking cell type deconvolution methods with human
brain data Leonardo Collado Torres, LIBD Investigator + Asst. Prof. Johns Hopkins Biostatistics Single-cell genomics webinar LA speaker at WCS Sept 26 2024 Slides available at speakerdeck.com/lcolladotor

• Bioinformatics • R and Bioconductor • Reproducibility and best
practices • Outreach and community building • Back in 2005 at @LCGUNAM: I like math and coding; biology provides the challenging problems What defines me

History 2005-2009 Undergrad in Genomic Sciences 2009-2011 2011-2016 August 2016+
Data Science Division Leader 🇽 🇽 PIs: • Jeff Leek: 2012+ • Andrew Jaffe: 2013+ Ph.D. Biostatistics Staff Scientist I → II → Research Scientist → Investigator Data Science Team I PIs: • Andrew Jaffe 2016-2020 • Myself 2020+ Division Leader: Keri Martinowich 2024+

2008+ • BioC 2008-2011, 2014, 2017, 2019-2023 • useR!2013, 2021
• rOpenSci unconf 2018 • RStudio::conf 2019-2021 @lcolladotor 2010+ @LIBDrstats 2018+ @CDSBMexico 2018+ Defunct: BmoreBiostats, Biostats Cultural Mixers Guest @RLadiesBmore #RLadiesMx Blog: http://lcolladotor.github.io 2011+ FB: 75k, Tw: 66k weekly Interests

doi.org/10.1016/j.biopsych.2020.06.005 Michael Gandal @mikejg84 Transcriptomic Insight Into the Polygenic Mechanisms
Underlying Psychiatric Disorders

Background: Cell Types in the Brain • The brain is
made of complex tissues consisting of different types of cells • Some diagnoses associated with changes in cell type speciﬁc expression ◦ Ex. Pitt-Hopkins syndrome and oligodendrocytes (Phan et al, Nature Neuroscience, 2020) 6 Louise Huuki-Myers @lahuuki speakerdeck.com/lahuuki/benchmarki ng-deconvolution-methods-in-the-hum an-brain

How can we connect bulk RNA-seq to cell type information?
Tissue Bulk RNA-seq snRNA-seq Estimated proportions 7 Deconvolution $$$ $ Free!

What is Deconvolution? Computational method that... • Infers the composition
of different cell types in a bulk RNA-seq data • Utilizes single cell data to obtain cell type gene expression proﬁles 8

Why is Deconvolution Important? • Tissue is heterogeneous ◦ Different
cell types express genes at different levels • Samples can differ in cell type composition due to biology or dissection ◦ Check for differences in case vs. control • Controlling for cell fractions between samples can make case vs. control analysis cleaner ◦ Quality control ◦ Confounding factor in differential expression analysis - prevents false-positives and false-negatives 9

How do you run deconvolution? 10 deconvolution(Y, Z) = Proportion
of Cell Types Gene Expression Bulk RNA-seq Sample Gene Expression scRNA-seq cell type Populations Computational Algorithm Bulk Samples Proportion

• There are 20+ single cell reference based methods published
deconvolution(Y, Z) = Proportion of Cell Types Which Method Should We Use? ? ? ? ? 11

Which Method is the Most Accurate? • Benchmarking shows that
different methods perform best on different data sets (Cobos et al, Nature Communications, 2020) • Benchmarking results from different papers on “real” data ◦ MuSiC paper: MuSiC > NNLS > BSEQ-sx > CIBERSORT ▪ Pancreatic Islet: Beta cells vs. HbA1c (Fig 2a) ◦ Bisque paper: Bisque > MuSiC > CIBERSORT ▪ DLPFC: Microglia vs. Braak stage, Neuron vs. Cognitive diagnostic category (Fig 4) ◦ Cobos et al. benchmark: DWLS > MuSiC > Bisque > deconvoSeq ▪ Human PMBC ﬂow sorted (Fig 7) ◦ Jin et al. benchmark: CIBERSORT, MuSiC > EPIC*, TIMER, DeconRNAseq ▪ Human Whole Blood, simulations ◦ Dai et al., benchmark: Dtangle > Bisque > Other Methods ▪ human brain IHC & scRNA-seq data 12

Goals of Deconvolution Benchmark • Build multi-assay dataset with orthogonal
cell type measurements • Test top deconvolution methods that employ different strategies • Assess impact of other factors in deconvolution ◦ Bulk RNA-seq data types ◦ snRNA-seq features ◦ Marker genes 13

How can we build on previous benchmarks? Previous Strategies to
Assess Accuracy • Use pseudobulk samples ◦ Known or simulated composition ◦ May not reflect real bulk RNA-seq data • Compare with Immunofluorescence Data • Cell flow sorting ◦ Difficult to label nuclei by cell type 14 Our Strategy • Use paired orthogonal imaging data to measure cell type proportions & evaluate method accuracy • Focus on brain tissue

Orthogonal Data • Alternative measurement of the same thing (cell
type proportions) ◦ Multiple independent measurements build conﬁdence • “Gold standard” ◦ *All methods have biases 15

Multi-modal dataset From Human DLPFC 16

Spatial DLPFC Dataset 17 Kelsey Montgomery Louise Huuki-Myers

18 Experimental Design

Huuki-Myers et al, Science, 2024 10.1126/science.adh1938 • 10 Donors (n=19)
• Seven broad cell types • 56k nuclei 19 Single Nucleus RNA-seq References

Bulk RNA-seq ← Library Type → ← RNA Extraction →
n = 110 6 library type + RNA Extraction combinations 20

RNAScope/IF Experiment Design • Measure the abundance of 6 broad
cell types • Filtered for high quality images Kelsey Montgomery 21

RNAScope/IF Estimated Cell Type Proportions 22

RNAScope Cell Type Annotations Make Sense Spatially 23

RNAScope vs. snRNA-seq Proportions 24 Comparing Cell Type Proportions •
Pearson’s correlation (cor) • Root Mean Squared Error (rmse) • Relative rmse (rrmse)

deconvolution(Y, Z) = Proportion of Cell Types Six Methods 1.
DWLS 2. Bisque 3. MuSiC 4. BayesPrism 5. hspe 6. CIBERSORTx vs. 25 Experimental Design Connection to Benchmark

Marker Genes Method Deconvolution Benchmark 26 Dataset Features

Evaluate Deconvolution Methods 27

Evaluate Deconvolution Methods 28 Method 1. What is the most
accurate deconvolution method for brain tissue? 2. Is accuracy impacted by type of bulk RNA-seq? a. Library type? b. RNA extraction?

Run Deconvolution 29 deconvolution(Y, Z) = Proportion of Cell Types
110 bulk samples Paired snRNA-seq 7 cell types

Methods return a wide range of proportion estimates 30 B2720_post
Each Tissue Block has 6 Bulk RNA-seq samples

31 All 19 Tissue Blocks (110 bulk RNA-seq samples)

Bisque and hspe are Most Accurate Methods Compared to RNAScope/IF
Accurate Methods have: • High Pearson’s correlation (cor) • Low Root Mean Squared Error (rmse) 32

33 Bisque and hspe are Most Accurate Methods Compared to
snRNA-seq

Library Type Impacts Method Performance Compared to RNAScope/IF 34

Method Evaluate Six Deconvolution Methods 35 1. What is the
most accurate deconvolution method for brain tissue? hspe & Bisque 2. Is accuracy impacted by type of bulk RNA-seq? Yes a. Library type? Bisque more accurate in polyA, hspe in RiboZeroGold b. RNA extraction? Some impact but inconsistent

Marker Genes 36

Marker Genes Select Effective Marker Genes 37 1. Does selecting
marker genes improve deconvolution? 2. How to best select good sets of marker genes?

Marker Gene Selection • Filter for genes expressed in snRNA-seq
and bulk data • Looking for genes expressed in only one cell type ◦ Test for speciﬁcity of each gene for each cell type • Observe expression of selected marker genes ◦ Heat maps of pseudobulked data The Ideal Heatmap snRNAseq data, Pseudobulked by cell type 38 Stephanie C Hicks Marker Genes

1 vs. All Marker Gene Selection 39 scran::findMarkers()

Mean Ratio Gene Selection DeconvoBuddies::get_mean_ratio()

Mean Ratio selects a subset of genes with high 1vAll
fold change 41

Marker Gene Sets Tested 1. Full (17,804 genes) a. set
of genes common between the bulk and snRNA-seq datasets 2. 1vALL top25 (145 genes) a. top 25 genes ranked by fold change for each cell type, then ﬁltered to common genes 3. MeanRatio top25 (151 genes) a. top 25 genes ranked by MeanRatio for each cell type, then ﬁltered to common genes 4. MeanRatio over2 (557 genes) a. All genes for each cell type with MeanRatio > 2 5. MeanRatio MAD3 (520 genes) a. All genes for each cell type with MeanRatio > 3 median absolute deviations (MADs) greater than the median of all MeanRatios > 1 42

Method Performance Varied Over Different Marker Gene Sets 43 Method’s
highest cor Lowest rmse Mean Ratio top 25

Method Performance Over Different Marker Gene Sets Mean Ratio Top25
Balances rmse and cor in top methods 44

Marker Genes Select Effective Marker Genes 45 1. Does selecting
marker genes improve deconvolution? Depends on the method ◦ hspe more sensitive than Bisque 2. How to best select good sets of marker genes? Mean Ratio top25 ◦ Mean Ratio top25 balanced rmse and correlation in Bisque & hspe

Other Datasets & Challenges 46

Other Factors Can Impact Method Performance 47 Dataset Features 1.
What Features of snRNA-seq reference dataset can impact deconvolution accuracy? a. Number of donors? b. Donor diversity? c. Existing proportion of cell types?

48 Tran, Maynard et al., Neuron, 2021 Mathys et al.,
Nature, 2019 Paired snRNA-seq Features of Other DLPFC snRNA-seq Datasets

Method Performance with different snRNA-seq Reference 49

Changing Cell Type Proportions Nick Eagles x 1000 Sub- samples
50

Changing Cell Type Proportions Nick Eagles 51

Other Factors Can Impact Method Performance 52 Dataset Features 1.
What features of snRNA-seq reference dataset can impact deconvolution accuracy? a. Number of donors? Bisque performs poorly with <4 donors b. Donor diversity? Bisque and hspe were unaffected by inclusion of AD cases c. Existing proportion of cell types? Bisque is biased to snRNA-seq proportions

Conclusions 53

Marker Genes Method Benchmark Conclusions 54 Dataset Features hspe &
Bisque are top performing methods • hspe better for RiboZeroGold Mean Ratio effectively selects cell type speciﬁc genes • MR Top 25 improves performance of top methods Many factors impact deconvolution accuracy • Bisque is sensitive to low donors and input cell proportions

How do our conclusions compare to other benchmarks? 55 Benchmark
Strategy Tissue Top Methods Cobos et al. Pseudobulk, Flow sorting Blood, pancreas, kidney DWLS Jin et al. Flow sorting Blood CIBERSORT, MuSiC Dai et al. Immunohistochemistry, scRNA-seq pseudobulk Brain 🧠 dtangle (hspe), Bisque

How do our conclusions compare to other benchmarks? 56 Benchmark
Strategy Tissue Top Methods Cobos et al. Pseudobulk, Flow sorting Blood, pancreas, kidney DWLS Jin et al. Flow sorting Blood CIBERSORT, MuSiC Dai et al. Immunohistochemistry, scRNA-seq pseudobulk Brain 🧠 dtangle (hspe), Bisque LIBD RNAScope/IF Brain 🧠 hspe, Bisque new! ✅

Resources • DeconvoBuddies R package ◦ R/Bioconductor package with tools
for marker ﬁnding & plotting ◦ https://research.libd.org/DeconvoBuddies/ ◦ Access paired dataset ▪ Bulk RNA-seq ▪ snRNA-seq data ▪ RNAScope Proportions • Deconvolution code tutorial + video ◦ updated version at LIBD Rstats club on May 3rd 57

Benchmark Paper now in Pre-print! 🎉 58

Acknowledgements Kristen Maynard Stephanie C Hicks 59 Kelsey Montgomery Sang
Ho Kwon Sean Maden Nick Eagles Thank you! Any Questions? Sophia Cinquemani Download these slides: speakerdeck.com/lahuuki @lahuuki Daianna Gonzalez-Padilla NIMH Grant: R01 MH123183 & R01 MH111721 Louise Huuki-Myers

Selected Six Deconvolution Methods 60 Method Citation Approach Marker Gene
Selection Availability Top Benchmark Performance DWLS (Dampened weighted least-squares) Tsoucas et al, Nature Comm, 2019 [5] weighted least squares - R package on CRAN Cobos et al. [18] Bisque Jew et al, Nature Comm, 2020 [6] Bias correction: Assay - R package on GitHub Dai et al. [17] MuSiC (Multi-subject Single-cell) Wang et al, Nature Communications, 2019 [7] Bias correction: Source Weights Genes R package GitHub Jin et al. [20] BayesPrism Chu et al., Nature Cancer, 2022 [8] Bayesian Pairwise t-test Webtool R package on GitHub Hippen et al. [22] hspe (dtangle) (hybrid-scale proportion estimation) Hunt and Gagnon-Bartsch, Ann. Appl. Stat. 2021 [9, 45] High collinearity adjustment Multiple options- default “ratio” 1vALL mean expression ratio R package on GitHub Dai et al. [17] CIBERSORTx Newman et al., Nat Biotech, 2019 [11] Machine Learning Differential Gene expression Webtool, Docker Image Jin et al. [20]

Comparing Estimates • Bisque vs. hspe predict similar proportions ◦
Cor = 0.938 • Bisque has highest cor with snRNA-seq ◦ Cor = 0.743 61

Evaluate by Library Type + RNA Extraction Combination 62

Method Predictions over 13 Brain Regions GTEx v8 Brain dataset
Expected patterns • Cerebellum contains more Inhib • Caudate having an increased proportion of inhibitory neurons compared to frontal cortex 63

Considering Cell Size 64

Considering Cell Size Nick Eagles 65

Dai et al. benchmark • Top deconvolution methods: dtangle (hspe)
and Bisque • Cell Type speciﬁc expression methods: bMIND 66 Figure 2 Figure 3

WCS-LA-2024

WCS-LA-2024

More Decks by Leonardo Collado-Torres

Other Decks in Science

Featured

Transcript