Data source: NCES Digest of Educa(on Sta(s(cs Graduated from high school Completed two REUs in Biosta:s:cs Graduated from LSU (BS, Mathema:cs); Started a PhD in Sta:s:cs
help Focus on DNA methyla:on data (But these challenges are common to other areas of genomics) Morgan et al. (1999). Nature Gene+cs 23: 314-‐8 h,p://epigenome.eu/en/2,48,873 Bradbury (2003). PLoS Biology 1: e82
705-‐719 Problem: Which CpGs are differen(ally methylated between two groups? Some proposed sta:s:cal solu:ons: At each CpG, test if there is a difference using e.g. t-‐test, F-‐test or linear regression
methylated, would a CpG nearby be also methylated? Some proposed solu:ons: (1) Can we find two or more runs of differen(ally methylated CpGs? • If p-‐value < 0.05 for CpG #1, #2, #3, etc… • Cau(on: mul(ple tes(ng (2) Can we smooth across CpGs and find genomic regions that are differen(ally methylated?
biases and unwanted technical varia:on – e.g. sequencing technology, batch effects – Can cause perceived differences between samples, irrespec(ve of the biological varia:on • Changes in experimental condi(ons can be confused with biological variability – Can lead to false discoveries (e.g. finding DMRs)
• Originally developed for gene expression microarrays • Now applied to – Genotyping arrays, RNA-‐Sequencing, DNA methyla(on, ChIP-‐Sequencing & Brain imaging Can be very helpful in elimina(ng unwanted varia(on e.g. ``batch effects'' (good), but has poten(al to wash out true biological varia(on (bad)
transforma(on that replaces each intensity score with the mean of the features with the same rank from each array Raw data Order values within each sample (or column) Re-order averaged values in original order 2 4 4 5 5 14 4 7 4 8 6 9 3 8 5 8 3 9 3 5 2 4 3 5 3 8 4 5 3 8 4 7 4 9 5 8 5 14 6 9 3.5 3.5 5.0 5.0 8.5 8.5 5.5 5.5 6.5 5.0 8.5 8.5 5.0 5.5 6.5 6.5 5.5 6.5 3.5 3.5 3.5 3.5 3.5 3.5 5.0 5.0 5.0 5.0 5.5 5.5 5.5 5.5 6.5 6.5 6.5 6.5 8.5 8.5 8.5 8.5 Average across rows and substitute value with average
• R/Bioconductor package to test for the assump(ons of quan(le normaliza(on 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) Main idea: • Compare variability within groups to variability between groups • If variability between groups > variability within groups, then there may be global changes across groups
0.00 0.02 0.04 0.06 0.08 rlogTransformation counts density GG (n=18) AG (n=32) AA (n=15) 6 8 10 12 14 16 0.0 0.1 0.2 0.3 log2 PM values density Nonsmoker (n=15) Smoker (n=15) Asthmatic (n=15) 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) A B C Observed variation Reason? Small technical variability; no global changes Large technical variability or batch effects within groups; no global changes Global technical variability or batch effects across groups Global biological variability across groups What to do? Use quantile normalization (but not necessary) Small variability within groups, Small variability across groups Large variability within groups, Small variability across groups Small variability within groups, Large variability across groups Use quantile normalization Use quantile normalization Do not use quantile normalization quantro will detect global differences due to both technical and biological variation Global changes Targeted changes Targeted changes Raw data alone cannot detect difference
any data! • Sta(s(cs can help iden(fy relevant biological varia:on in genomics data – Differences in CpGs – Smoothing across genomic regions • Sta:s:cs can help eliminate unwanted technical varia:on in genomics data – “Batch effects”