Introduction to PCA, the underlying math and some applications. Understanding of basic linear algebra is assumed. For the original HTML5 deck please go to http://benmabey.com/presentations/pca-tutorial/
This deck can be found in its original (and better) HTML5 form at benmabey.com/presentations/pca-tutorial/ .N.B.: The deck isn't completely standalone since I don't explain every step made as I did when actually presenting it. That said I think the deck should be useful for anyone who wants to get a quick idea of what PCA is and the math behind it (I only take into account conventional PCA, not probabilistic interpretations). I am inconsistent with some of my equations to make some of the algebra easier (all legal though!) which I explained during the actual presentation. For people who want to go deeper and follow the math more closely I highly recommend the tutorial by Jonathan Shlens which is where I got most of my derivations. See the last slide of the deck for additional resources. 2/51
tutorials... 1. Organize dataset as matrix. 2. Subtract off the mean for each measurement. 3. Calculate the covariance matrix and perform eigendecomposition. 4. Profit! 4/51
tutorials... 1. Organize dataset as matrix. 2. Subtract off the mean for each measurement. 3. Calculate the covariance correlation matrix and perform eigendecomposition. 4. Profit! 5/51
tutorials... 1. Organize dataset as matrix. 2. Subtract off the mean for each measurement. 3. Calculate the covariance correlation matrix and perform eigendecomposition. 4. Perform SVD. 5. Profit! 6/51
r a r y ( P e r f o r m a n c e A n a l y t i c s ) c h a r t . C o r r e l a t i o n ( i r i s [ - 5 ] , b g = i r i s $ S p e c i e s , p c h = 2 1 ) 15/51
MATHEMATICALLY USEFUL INTUITIVE Dispersion Relationship unitless measure or is if and only if and are uncorrelated. = var(A) 2 2 A = = E[(A J ] + A ) 2 ( J 1 n ∑ i=1 n a i + A )2 = stddev(A) = 2 A var(A) _ _ _ _ _ _ R = cov(A, B) 2 AB = = E[(A J )(B J )] + A + B ( J )( J ) 1 n ∑ i=1 n a i + A b i + B = = 0 AB 2 AB 2 A 2 B cov(AB) stddev(A) stddev(B) ( J 1.0..1.0) cov(A, A) = var(A) 2 AB 0 AB 0 A B 17/51
so that it has zero mean. Now ¤ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 2 1,1 2 2,1 ' 2 n,1 2 1,2 2 2,2 ' 2 n,2 ( ( * ( 2 1,n 2 2,n ' 2 n,n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ X = 2 AB 1 n I n i=1 a i b i = X ¤ X 1 n X T 18/51
t e r < - f u n c t i o n ( x ) x - m e a n ( x ) i r i s . c e n t e r e d < - a p p l y ( a s . m a t r i x ( i r i s [ - 5 ] ) , 2 , c e n t e r ) ( t ( i r i s . c e n t e r e d ) % * % i r i s . c e n t e r e d ) / ( n r o w ( i r i s ) - 1 ) # # S e p a l . L e n g t h S e p a l . W i d t h P e t a l . L e n g t h P e t a l . W i d t h # # S e p a l . L e n g t h 0 . 6 8 5 6 9 - 0 . 0 4 2 4 3 1 . 2 7 4 3 0 . 5 1 6 3 # # S e p a l . W i d t h - 0 . 0 4 2 4 3 0 . 1 8 9 9 8 - 0 . 3 2 9 7 - 0 . 1 2 1 6 # # P e t a l . L e n g t h 1 . 2 7 4 3 2 - 0 . 3 2 9 6 6 3 . 1 1 6 3 1 . 2 9 5 6 # # P e t a l . W i d t h 0 . 5 1 6 2 7 - 0 . 1 2 1 6 4 1 . 2 9 5 6 0 . 5 8 1 0 19/51
t e r < - f u n c t i o n ( x ) x - m e a n ( x ) m . c e n t e r e d < - a p p l y ( a s . m a t r i x ( i r i s [ - 5 ] ) , 2 , c e n t e r ) ( t ( m . c e n t e r e d ) % * % m . c e n t e r e d ) / ( n r o w ( i r i s ) - 1 ) # # S e p a l . L e n g t h S e p a l . W i d t h P e t a l . L e n g t h P e t a l . W i d t h # # S e p a l . L e n g t h 0 . 6 8 5 6 9 - 0 . 0 4 2 4 3 1 . 2 7 4 3 0 . 5 1 6 3 # # S e p a l . W i d t h - 0 . 0 4 2 4 3 0 . 1 8 9 9 8 - 0 . 3 2 9 7 - 0 . 1 2 1 6 # # P e t a l . L e n g t h 1 . 2 7 4 3 2 - 0 . 3 2 9 6 6 3 . 1 1 6 3 1 . 2 9 5 6 # # P e t a l . W i d t h 0 . 5 1 6 2 7 - 0 . 1 2 1 6 4 1 . 2 9 5 6 0 . 5 8 1 0 c o v ( i r i s [ - 5 ] ) # # S e p a l . L e n g t h S e p a l . W i d t h P e t a l . L e n g t h P e t a l . W i d t h # # S e p a l . L e n g t h 0 . 6 8 5 6 9 - 0 . 0 4 2 4 3 1 . 2 7 4 3 0 . 5 1 6 3 # # S e p a l . W i d t h - 0 . 0 4 2 4 3 0 . 1 8 9 9 8 - 0 . 3 2 9 7 - 0 . 1 2 1 6 # # P e t a l . L e n g t h 1 . 2 7 4 3 2 - 0 . 3 2 9 6 6 3 . 1 1 6 3 1 . 2 9 5 6 # # P e t a l . W i d t h 0 . 5 1 6 2 7 - 0 . 1 2 1 6 4 1 . 2 9 5 6 0 . 5 8 1 0 20/51
some orthonormal matrix in such that is a diagonal matrix. The rows of are the principal components of . Note, that I transposed the design matrix (the data) so that covariance calculation is also reversed. This will make our life easier... P PX = Y = Y ¤ Y Y T Y n P X 22/51
choosing what is... Let every row, , be an eigenvector of . What this means is that where comes from the eigendecomposition of . P p i ¤ X P = Q T Q ¤ X = Q | ¤ X Q T 26/51
crank... ¤ Y ¤ Y = = = = = = P ¤ X P T P(Q | ) Q T P T P( | P) P T P T (P ) | (P ) P T P T I | I | ¤ X The principal components are linear combinations of original features of . The principal components of are the eigenvectors of . The corresponding eigenvaules lie in and represent the variance. · X · X ¤ X · ¤ Y 27/51
R i r i s . e i g e n = e i g e n ( c o v ( i r i s . c e n t e r e d ) ) r o w n a m e s ( i r i s . e i g e n $ v e c t o r s ) = c o l n a m e s ( i r i s . c e n t e r e d ) c o l n a m e s ( i r i s . e i g e n $ v e c t o r s ) = c ( " P C 1 " , " P C 2 " , " P C 3 " , " P C 4 " ) i r i s . e i g e n # # $ v a l u e s # # [ 1 ] 4 . 2 2 8 2 4 0 . 2 4 2 6 7 0 . 0 7 8 2 1 0 . 0 2 3 8 4 # # # # $ v e c t o r s # # P C 1 P C 2 P C 3 P C 4 # # S e p a l . L e n g t h 0 . 3 6 1 3 9 - 0 . 6 5 6 5 9 - 0 . 5 8 2 0 3 0 . 3 1 5 5 # # S e p a l . W i d t h - 0 . 0 8 4 5 2 - 0 . 7 3 0 1 6 0 . 5 9 7 9 1 - 0 . 3 1 9 7 # # P e t a l . L e n g t h 0 . 8 5 6 6 7 0 . 1 7 3 3 7 0 . 0 7 6 2 4 - 0 . 4 7 9 8 # # P e t a l . W i d t h 0 . 3 5 8 2 9 0 . 0 7 5 4 8 0 . 5 4 5 8 3 0 . 7 5 3 7 28/51
intuitive... i r i s . e i g e n $ v e c t o r s ^ 2 # # P C 1 P C 2 P C 3 P C 4 # # S e p a l . L e n g t h 0 . 1 3 0 6 0 0 0 . 4 3 1 1 0 9 0 . 3 3 8 7 5 9 0 . 0 9 9 5 3 # # S e p a l . W i d t h 0 . 0 0 7 1 4 4 0 . 5 3 3 1 3 6 0 . 3 5 7 4 9 7 0 . 1 0 2 2 2 # # P e t a l . L e n g t h 0 . 7 3 3 8 8 5 0 . 0 3 0 0 5 8 0 . 0 0 5 8 1 2 0 . 2 3 0 2 5 # # P e t a l . W i d t h 0 . 1 2 8 3 7 1 0 . 0 0 5 6 9 7 0 . 2 9 7 9 3 2 0 . 5 6 8 0 0 29/51
a r e d < - i r i s . e i g e n $ v e c t o r s ^ 2 s o r t e d . s q u a r e s < - s q u a r e d [ o r d e r ( s q u a r e d [ , 1 ] ) , 1 ] d o t p l o t ( s o r t e d . s q u a r e s , m a i n = " V a r i a b l e C o n t r i b u t i o n s t o P C 1 " , c e x = 1 . 5 , c o l = " r e d " ) 30/51
b r a r y ( F a c t o M i n e R ) ; i r i s . p c a < - P C A ( i r i s , q u a l i . s u p = 5 ) p l o t ( i r i s . p c a , c h o i x = " v a r " , t i t l e = " C o r r e l a t i o n C i r c l e " ) 31/51
s . p c a < - P C A ( d e c a t h l o n , q u a n t i . s u p = 1 1 : 1 2 , q u a l i . s u p = 1 3 ) p l o t ( r e s . p c a , c h o i x = " v a r " , t i t l e = " C o r r e l a t i o n C i r c l e " ) 32/51
variance (eigenvaules) tell us? i r i s . e i g e n $ v a l u e s # T h e v a r i a n c e f o r e a c h c o r r e s p o n d i n g P C # # [ 1 ] 4 . 2 2 8 2 4 0 . 2 4 2 6 7 0 . 0 7 8 2 1 0 . 0 2 3 8 4 33/51
b r a r y ( F a c t o M i n e R ) ; i r i s . p c a < - P C A ( i r i s , q u a l i . s u p = 5 ) p l o t ( i r i s . p c a , h a b i l l a g e = 5 , c o l . h a b = c ( " g r e e n " , " b l u e " , " r e d " ) , t i t l e = " D a t a s e t p r o j e c t e d o n t o P C 1 - 2 S u 34/51
should you keep? Ratio of variance retained (e.g. 99% is common): Ik i=1 2 i In i=1 2 i c u m s u m ( i r i s . e i g e n $ v a l u e s / s u m ( i r i s . e i g e n $ v a l u e s ) ) # # [ 1 ] 0 . 9 2 4 6 0 . 9 7 7 7 0 . 9 9 4 8 1 . 0 0 0 0 35/51
i r i s . p r c o m p < - p r c o m p ( i r i s [ - 5 ] , c e n t e r = T R U E , s c a l e = F A L S E ) s c r e e p l o t ( i r i s . p r c o m p , t y p e = " l i n e " , m a i n = " S c r e e P l o t " ) 36/51
only the components whose eigenvalue is larger than the average eigenvalue. For a correlation PCA, this rule boils down to the standard advice to "keep only the eigenvalues larger than 1". e i g e n ( c o r ( i r i s . c e n t e r e d ) ) $ v a l u e s # # [ 1 ] 2 . 9 1 8 5 0 0 . 9 1 4 0 3 0 . 1 4 6 7 6 0 . 0 2 0 7 1 37/51
ways to aid interpretation... i r i s . p r c o m p < - p r c o m p ( i r i s [ - 5 ] , c e n t e r = T R U E , s c a l e = F A L S E ) b i p l o t ( i r i s . p r c o m p ) 39/51
perform? s c a l e d . i r i s < - i r i s s c a l e d . i r i s $ P e t a l . L e n g t h < - i r i s $ P e t a l . L e n g t h / 1 0 0 0 s c a l e d . i r i s $ P e t a l . W i d t h < - i r i s $ P e t a l . W i d t h / 1 0 0 0 s c a l e d . i r i s $ S e p a l . W i d t h < - i r i s $ S e p a l . W i d t h * 1 0 43/51
Standardize the data # ( I n p r a c t i c e j u s t u s e t h e b u i l t - i n c o r f u n c t i o n ) s t a n d a r d i z e < - f u n c t i o n ( x ) { c e n t e r e d < - x - m e a n ( x ) c e n t e r e d / s d ( c e n t e r e d ) } s c a l e d . i r i s . s t a n d a r d i z e d < - a p p l y ( a s . m a t r i x ( s c a l e d . i r i s [ - 5 ] ) , 2 , s t a n d a r d i z e ) ( t ( s c a l e d . i r i s . s t a n d a r d i z e d ) % * % s c a l e d . i r i s . s t a n d a r d i z e d ) / ( n r o w ( i r i s ) - 1 ) # # S e p a l . L e n g t h S e p a l . W i d t h P e t a l . L e n g t h P e t a l . W i d t h # # S e p a l . L e n g t h 1 . 0 0 0 0 - 0 . 1 1 7 6 0 . 8 7 1 8 0 . 8 1 7 9 # # S e p a l . W i d t h - 0 . 1 1 7 6 1 . 0 0 0 0 - 0 . 4 2 8 4 - 0 . 3 6 6 1 # # P e t a l . L e n g t h 0 . 8 7 1 8 - 0 . 4 2 8 4 1 . 0 0 0 0 0 . 9 6 2 9 # # P e t a l . W i d t h 0 . 8 1 7 9 - 0 . 3 6 6 1 0 . 9 6 2 9 1 . 0 0 0 0 45/51
SVD? And how is it equivalent? Short answer on why: SVD is more numerically stable More efficient Especially when operating on a wide matrix.. you skip the step of calculating the covariance matrix There are a lot of SVD algoritms and implementations to choose from · · · 46/51
familar... Recall that eigendecomposition for an orthonormal matrix is . Therefore are the eigenvectors of and are the eigenvalues. Likewise are the eigenvectors of and are the eigenvalues. AA T A A T A AA T AA T = = = = = UDV T UD (UD V T V T ) T UD V V T D T U T UD ( V = I since V, and U, are orthonormal) D T U T V T U (since D is a diagnol matrix) D 2 U T A = Q | Q T U AA T D 2 V A A T D 2 48/51
once more... Let a new matrix where each column of is mean centered. So, if we run SVD on our then will contain the eigenvectors of ... 's principal components! Our eigenvalues, the variances, will be . Y = 1 n √ X T Y Y Y T Y Y T = = = ( ( ) 1 n _ _ √ X T ) T 1 n _ _ √ X T X 1 n X T ¤ X Y V ¤ X X D 2 49/51
- i r i s . c e n t e r e d / s q r t ( n r o w ( i r i s ) - 1 ) y . s v d < - s v d ( y ) p c s < - y . s v d $ v r o w n a m e s ( p c s ) = c o l n a m e s ( i r i s . c e n t e r e d ) c o l n a m e s ( p c s ) = c ( " P C 1 " , " P C 2 " , " P C 3 " , " P C 4 " ) p c s # # P C 1 P C 2 P C 3 P C 4 # # S e p a l . L e n g t h 0 . 3 6 1 3 9 - 0 . 6 5 6 5 9 0 . 5 8 2 0 3 0 . 3 1 5 5 # # S e p a l . W i d t h - 0 . 0 8 4 5 2 - 0 . 7 3 0 1 6 - 0 . 5 9 7 9 1 - 0 . 3 1 9 7 # # P e t a l . L e n g t h 0 . 8 5 6 6 7 0 . 1 7 3 3 7 - 0 . 0 7 6 2 4 - 0 . 4 7 9 8 # # P e t a l . W i d t h 0 . 3 5 8 2 9 0 . 0 7 5 4 8 - 0 . 5 4 5 8 3 0 . 7 5 3 7 y . s v d $ d ^ 2 # v a r i a n c e s # # [ 1 ] 4 . 2 2 8 2 4 0 . 2 4 2 6 7 0 . 0 7 8 2 1 0 . 0 2 3 8 4 50/51
1. Jon Shlens (versions 2.0 and 3.1), Tutorial on Principal Component Analysis 2. H Abdi and L J Williams (2010), Principal component analysis 3. Andrew Ng (2009), cs229 Lecture Notes 10 4. Andrew Ng (2009), cs229 Lectures 14 & 15 5. Christopher Bishop (2006), Pattern Recognition and Machine Learning, section 12.1 6. Steve Pittard (2012), Principal Components Analysis Using R 7. Quick-R, Principal Components and Factor Analysis (good pointers to additional R packages) 8. C Ding, X He (2004), K-means Clustering via Principal Component Analysis