Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification

Function COSI ISMB2022 Tensor decomposition based unsupervised feature extraction with
optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification Y-h. Taguchi Ryo Ishibashi Department of Physics, Chuo University, Tokyo, Japan

Function COSI ISMB2022 Basic claims 1. Our method applied to
identification of differentially expressed genes (DEGs) can outperform various state of art methods when standard deviations (SDs) used to generate the null hypothesis are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.

Function COSI ISMB2022 The contents were published in the following
three preprints in bioRxiv. • https://doi.org/10.1101/2022.02.18.481115 • https://doi.org/10.1101/2022.04.02.486807 • https://doi.org/10.1101/2022.04.29.490081

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large
P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Function COSI ISMB2022 P >> N problem (large P, small
N problem) large P Traditionally, the number of variables is large Ex.) genes, epigenomes small N The number of samples is small Ex.) subjects, experimental animals, or cultured cells → difficult to handle computationally.

Function COSI ISMB2022 Why do we choose PCA and TD
? Merits ⚫Highly versatile ⚫Easy interpretation of results ⚫Easy to implement programs Effective for complicated life science data

Function COSI ISMB2022 What is tensor decomposition? xijk G ul1i
ul2j ul3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) N M K 𝑥𝑖𝑗𝑘 ≃ ෍ 𝑙1=1 𝐿1 ෍ 𝑙2=1 𝐿2 ෍ 𝑙3=1 𝐿3 𝐺 𝑙1 𝑙2 𝑙3 𝑢𝑙1𝑖 𝑢𝑙2𝑗 𝑢𝑙3𝑘 N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk : gene expression Example

Function COSI ISMB2022 Interpretation of TD (1/2) j: samples Healthy
control Patients ul2j For some specific l2 For some specific l3 k: tissues Tissue specific expression ul3k

Function COSI ISMB2022 Interpretation of TD (2/2) i:genes ul1i tDEG:
tissue specific Differentially Expressed Genes tDEG: Healthy controls < Patients tDEG: Healthy controls > Patients For some specific l1 with max |G(l1 l2 l3 )| If G(l1 l2 l3 )>0 Fixed

Function COSI ISMB2022 Analysis Procedure(1/2) Matrix Tensor PCA TD Gene
vectors Sample vectors Gene vectors Sample vectors Tissue vectors Gene Sample Sample Tissue Gene

ISMB2022 Analysis Procedure(2/2) Gaussian Dist. Novelty σ used to generate
the null hypothesis are optimized. Gene Selection 0 1 1-P DEG 𝜎ℎ = σ𝑛<𝑛0 ℎ𝑛 − ℎ𝑛 2 𝑁𝑛 𝑛 < 𝑛0

Function COSI ISMB2022 MAQC(benchmark data set for DEG *) RNA-seq:
x ij represents expression of ith gene at jth sample Samples: seven Universal Human Reference RNA (UHRR) vs seven Human Brain Reference RNA (HBRR) Measured for 40933 genes (done by the presenter) （＊）https://www.fda.gov/science-research/bioinformatics- tools/microarraysequencing-quality-control-maqcseqc

Function COSI ISMB2022 Gene wise Log FC ratio between two
classes Gene wise mean log x ij Density distribution MA plot

Function COSI ISMB2022 Sapmle wise principal components v 1j v
2j Corresponds to mean log x ij Corresponds to log FC ratio Corresponds to MA plot PCA

Function COSI ISMB2022 u 1i u 2i Density distribution Gene
wise embedding by PCA

Function COSI ISMB2022 Null hypothesis: u 2i obeys Gaussian Left:
Right: optimal σ minimizes σ h n 0 h n h n n n Select genes with adjusted P i <0.1 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1

Function COSI ISMB2022 “Highly expressed genes should be more likely
selected” DESeq2: empirical dispersion relation PCA : naturally satisfied μ σ2 σ2

Function COSI ISMB2022 Biological validation PCA vs DESeq2 Tissue specificity

Function COSI ISMB2022 PCA based unsupervised FE with optimized SD
outperforms various state of the art methods while assuming neither empirical dispersion relation nor negative binomial distribution.

Function COSI ISMB2022 Application to DMC identification ( EH1072, sequencing)
Chromosome t test of P-values attributed by PCA between DHS and non-DHS

Function COSI ISMB2022 PCA and TD based unsupervised FE with
optimized SD can be applied to identification of DMC without the specific modification • https://doi.org/10.1101/2022.04.02.486807

Function COSI ISMB2022 Application to differential histone modification Histograms of
1-P i do not obey Gaussian (double peak) but…. H3K4me3 H3K27me3 H3K27ac

Function COSI ISMB2022 Comparisons with other methods (H3K9me3, GSE24850) The
number of histone modification experiments overlapped with selected genes

Function COSI ISMB2022 PCA and TD based unsupervised FE with
optimized SD can be applied to identification of differential histone modification without the specific modification • https://doi.org/10.1101/2022.04.29.490081

Function COSI ISMB2022 Conclusions 1. Principal component analysis (PCA) based-
and tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differentially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.

Tensor decomposition based unsupervised feature...

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification

Ryo Ishibashi

Other Decks in Research

Featured

Transcript

Function COSI ISMB2022 Tensor decomposition based unsupervised feature extraction with

Function COSI ISMB2022 Basic claims 1. Our method applied to

Function COSI ISMB2022 The contents were published in the following

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large

Function COSI ISMB2022 P >> N problem (large P, small

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large

Function COSI ISMB2022 Why do we choose PCA and TD

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large

Function COSI ISMB2022 What is tensor decomposition? xijk G ul1i

Function COSI ISMB2022 Interpretation of TD (1/2) j: samples Healthy

Function COSI ISMB2022 Interpretation of TD (2/2) i:genes ul1i tDEG:

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large

Function COSI ISMB2022 Analysis Procedure(1/2) Matrix Tensor PCA TD Gene

ISMB2022 Analysis Procedure(2/2) Gaussian Dist. Novelty σ used to generate

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large

Function COSI ISMB2022 MAQC(benchmark data set for DEG *) RNA-seq:

Function COSI ISMB2022 Gene wise Log FC ratio between two

Function COSI ISMB2022 Sapmle wise principal components v 1j v

Function COSI ISMB2022 u 1i u 2i Density distribution Gene

Function COSI ISMB2022 Null hypothesis: u 2i obeys Gaussian Left:

Function COSI ISMB2022 “Highly expressed genes should be more likely

Function COSI ISMB2022 Biological validation PCA vs DESeq2 Tissue specificity

Function COSI ISMB2022 PCA based unsupervised FE with optimized SD

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large

Function COSI ISMB2022 Application to DMC identification ( EH1072, sequencing)

Function COSI ISMB2022 PCA and TD based unsupervised FE with

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large

Function COSI ISMB2022 Application to differential histone modification Histograms of

Function COSI ISMB2022 Comparisons with other methods (H3K9me3, GSE24850) The

Function COSI ISMB2022 PCA and TD based unsupervised FE with

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large

Function COSI ISMB2022 Conclusions 1. Principal component analysis (PCA) based-