Matrix and Tensor Factorization for Machine Learning

Matrix and Tensor Factorization for Machine Learning Special Topics in
Mechano-Informatics Ⅱ@ Tokyo University, 2024 Kazu Ghalamkari RIKEN AIP @KazuGhalamkari

3 3 Announcements Please click the raise hand button if
you can see the slides and hear me clearly. Please refrain from redistributing slides. Feel free to ask any questions using the chat window during the lecture. 🗎 The assignment is available on the final section. Some parts of the slide will be skipped due to time limitation.

4 Overview ▪ Introduction: Why do we decompose data? ▪
A quick review of matrix rank ▪ Singular value decomposition (SVD) and low-rank approximation ▪ Kernel subspace method and its applications (denoising, anomaly detection) ▪ Non-negative matrix factorization ▪ Tensor low-rank decomposition and many-body approximation

Diverse real-world data ▪ Purchase history ▪ Tabular data Sepal
Length [cm] Sepal Width [cm] Petal Length [cm] Petal Width [cm] Species 5.1 3.5 1.4 0.2 setosa 7 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 6.5 2.8 4.6 1.5 versicolor 6.6 2.9 4.6 1.3 versicolor 4.9 3 1.4 0.2 setosa 5.2 2.7 3.9 1.4 versicolor ▪ Grayscale image Decomposing the data as a matrix is beneficial ▪ Spectrum information Image from https://www.mathworks.com/help/images/image-types-in-the-toolbox_ja_JP.html Image from https://sigview.com/help/Time-FFTSpectrogram.html Image from Mithy, S. A., et al. "Classification of Iris Flower Dataset using Different Algorithms." Int. J. Sci. Res. In (2022).

Learning - Gaining insights from data decomposition Pattern Pattern 7

Learning - Gaining insights from data decomposition Beverages Snacks 8

Inference - Use the obtained knowledge Recommend to . Beverages
Snacks 9

What is a good decomposition? ・Is the decomposed representation interpretable?
・Is the decomposition scalable to large data? ・Is the decomposed representation unique? ・Can the decomposition be done even with missing values? Many Choose appropriate decomposition method by considering real-world constraints Many

Rank in linear algebra Each column is a constant multiple
of vector . We call such a matrix a rank-1 matrix. 6-dimensional vector space 12

Rank in linear algebra Each row is a constant multiple
of vector . We call such a matrix a rank-1 matrix. 6-dimensional vector space 13

Rank in linear algebra 14

Rank in linear algebra Each column of this matrix is
written by a linear combination of the base . The rank-2 matrix is the matrix that is a linear combination of two linearly independent vectors. Each column vector is in a 2-dimensional plane in the 6-dimensional vector space. Plane spanned by 6-dimensional vector space 15

Rank in linear algebra When the column vector of A
can be written by linear combination of , is the rank of matrix . Linear independent 𝑟 vectors (Basis) the rank of matrix A is the minimum integer number of 𝑟. A matrix whose rank is 𝑟 is called a rank 𝑟 matrix. Def. 16

Properties of the matrix rank Basis Rank Property 17

Properties of the matrix rank Basis A is full rank
when . A is rank deficient if . Rank Property 18

21 Summary: matrix rank Basis Rank Properties When the column
vector of A can be written by linear combination of , the rank of matrix A is the minimum integer number of 𝑟. Def.

the base could be any linearly independent vectors. Singular value
decomposition, SVD So far, we assumed that Imposes orthonormality on 23

Singular value decomposition, SVD Most significant rank-1 factor Least significant
rank-1 factor Important term Unimportant term Imposes orthonormality on : weight (importance) of the following rank-1 matrix Singular value

Singular value decomposition, SVD Imposes orthonormality on : weight (importance)
of the following rank-1 matrix Singular value 25

Singular value decomposition, SVD ▪ Any matrix A can be
decomposed into the product of U, V, and a diagonal matrix Σ It can be a complex, non-singular, or rectangular matrix 26

Singular value decomposition, SVD ▪ Any matrix A can be
decomposed into the product of U, V, and a diagonal matrix Σ 27 It can be a complex, non-singular, or rectangular matrix

Singular value decomposition, SVD ▪ A is a rank- matrix
Unnecessary 0 28

Singular value decomposition, SVD Rank and Singular Value 0 Unnecessary
The rank of matrix A is the number of non-zero singular values of A ▪ A is a rank- matrix 29

Low-rank approximation by SVD They are not as important, so
let's set them to 0 and ignore them. Important term Unimportant term ≒ 0 ▪ Decomposing rank-𝑟 matrix 𝐀 into orthogonal 𝐔 and 𝐕 and diagonal matrix ∑. Singular value : weight of the following rank-1 matrix 30

Low-rank approximation by SVD ≒ 0 ▪ Decomposing rank-𝑟 matrix
𝐀 into orthogonal 𝐔 and 𝐕 and diagonal matrix ∑. Approximating a matrix by a smaller rank matrix Low-rank approximation They are not as important, so let's set them to 0 and ignore them. Singular value : weight of the following rank-1 matrix 31

The BEST Low-rank approximation by SVD ▪ SVD provides the
best low-rank approximation minimizing Frobenius norm. is the best rank-k approximation for 𝐀, which minimizes Frobenius norm. Evaluate how close they are via the Frobenius norm. Frobenius norm Eckart-Young Theorem (1936) 32

The BEST Low-rank approximation by SVD ▪ SVD is the
best low-rank approximation minimizing unitary invariant norm. is the best rank-r matrix for 𝐀, which minimizes any unitary invariant norm. Evaluate how similar they are via unitary invariant norm. Eckart-Young-Mirsky Theorem (1960) * Unitary invariant norm ⋅ ∗ satisfies that 𝐏 ∗ = 𝐗𝐏𝐘 ∗ for any matrix P and any unitary matrices X and Y. 33

Low-rank approximation of matrices saves memory. Required memory before the
approximation Required memory sizes after approximation Low-rank approximation reduces required memory storage Example: Assuming is sufficiently smaller than . 34

Singular values and eigenvalues Put into The 𝑖-th singular value
𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ 35

Singular values and eigenvalues Each column vector is orthonormal Put
into The 𝑖-th singular value 𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ 36

The 𝑖-th singular value 𝜎𝑖 of matrix 𝐀 is the
square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ Eigenvalue of Eigenvector of Singular values and eigenvalues Each column vector is orthonormal Put into 𝑙 -th column of U 𝑙 -th column of V (𝑙, 𝑙)-entry of ∑

Singular values and eigenvalues Eigenvector of Put into Eigenvalue of
Eigenvector of The 𝑖-th singular value 𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ Put into 𝑙 -th column of U 𝑙 -th column of V (𝑙, 𝑙)-entry of ∑

Singular values and eigenvalues (Left singular vector) The 𝑖-th singular
value 𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ SVD of A is an eigenvalue decomposition of 𝐀𝐀⊤ and 𝐀⊤𝐀 The rank of matrix 𝐀 equals the number of non-zero singular values of 𝐀. Each column vector of 𝐕 is an eigenvector of 𝐀⊤𝐀. Each column vector of 𝐔 is an eigenvector of 𝐀𝐀⊤. (Right singular vector)

SVD examples in Python 42

SVD example in Python 43

SVD example in Python The reconstruction equals the original matrix.
44

SVD example in Python The orthonormality of the basis was
confirmed. 45

Example of low-rank approximation by SVD in Python Smaller rank
𝑘 worsens the approximation performance. 46

Image reconstruction by SVD How many ranks are needed for
reconstruction? 2000×1500 (2000+1500)×5 (2000+1500)×20 (2000+1500)×100 𝑘=100, 11.67% storage 𝑘=20, 2.33% storage 𝑘=5, 0.57% storage テキストが含まれている画像自動的に生成された説明 Image from Steven L. Brunton, J. Nathan Kutz, “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” Image from Steven L. Brunton, J. Nathan Kutz, “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” 47

Hyperparameter tuning in SVD Accurate Rough Increased memory Slow Memory
saving High speed (2000+1500)×5 (2000+1500)×20 (2000+1500)×100 Selecting the appropriate rank is a trade-off problem. It requires repeated trial and error. A typical hyper-parameter tuning problem 𝑘=100, 11.67% storage 𝑘=20, 2.33% storage 𝑘=5, 0.57% storage テキストが含まれている画像自動的に生成された説明 Image from Steven L. Brunton, J. Nathan Kutz, “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” Image from Steven L. Brunton, J. Nathan Kutz, “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” 48

Data are represented by a low-rank structure plus noise. Most
of the data in this world has a low-rank structure. Real data Low-rank matrix Noise Low-rank approximation is good enough for many datasets. ▪ Low-rank approximation can approximate real-world data well. 49

50 Singular values Summary: Low-rank approximation and SVD ▪ SVD
can be applied to any matrix. ▪ SVD finds the best low-rank matrix minimizing the Frobenius norm. are orthonormal. SVD provides the best rank-k approximation that minimizes the Frobenius norm. Eckart − Young Theorem ▪ Low-rank approximation reduces data memory requirements.

Classification of Iris Data Sepal Petal Sepal Petal ▪ Iris
Dataset[1] [1] Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188. We estimate the species of iris data-points based on their sepal and petal lengths. Sepal Length [cm] Sepal Width [cm] Petal Length [cm] Petal Width [cm] Species 5.1 3.5 1.4 0.2 setosa 7 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 6.5 2.8 4.6 1.5 versicolor 6.3 3.3 4.7 1.6 versicolor 6.6 2.9 4.6 1.3 versicolor 4.9 3 1.4 0.2 setosa 5.2 2.7 3.9 1.4 versicolor 5.9 3 4.2 1.5 ??? 5.6 3 4.5 1.5 ??? 4.7 3.2 1.6 0.2 ??? … … Image from Mithy, S. A., et al. "Classification of Iris Flower Dataset using Different Algorithms." Int. J. Sci. Res. In (2022). 53

Classification by the CLAFIC method Samples of class A Samples
of class B Estimate class of each sample Classification Training data setosa versicolor Sepal Length [cm] Sepal Width [cm] Petal Length [cm] Petal Width [cm] Species 5.1 3.5 1.4 0.2 setosa 7 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 6.5 2.8 4.6 1.5 versicolor 6.3 3.3 4.7 1.6 versicolor 6.6 2.9 4.6 1.3 versicolor 4.9 3 1.4 0.2 setosa 5.2 2.7 3.9 1.4 versicolor 5.9 3 4.2 1.5 ??? 5.6 3 4.5 1.5 ??? 4.7 3.2 1.6 0.2 ??? … … 54

Classification by the CLAFIC method Sepal Length [cm] Petal Width
[cm] Sepal Width [cm] Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor 55

of class B Estimate class of each sample Classification Training data setosa versicolor Sepal Length [cm] Petal Width [cm] Sepal Width [cm] 56

of class B Estimate class of each sample Classification Training data setosa versicolor 57

Classification by the CLAFIC method （We need normalize the data
so that both subspaces pass through the origin） Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor

Classification by the CLAFIC method Orthogonal basis Orthogonal basis ：hyper-parameter
Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor （We need normalize the data so that both subspaces pass through the origin）

Classification by the CLAFIC method ▪ Inference Find the closest
subspace. ▪ Learning SVD for each class samples. Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor （We need normalize the data so that both subspaces pass through the origin） If 𝑑𝑎 < 𝑑𝑏 , 𝑐 belongs to class A. If 𝑑𝑎 > 𝑑𝑏 , 𝑐 belongs to class B.

1-Neighborhood and CLAFIC 1-NN Classification by the nearest training data
Classification by the nearest subspace CLAFIC If 𝑑𝑎 < 𝑑𝑏 , 𝑐 belongs to class A. If 𝑑𝑎 > 𝑑𝑏 , 𝑐 belongs to class B. 64

65 Summary: Classification by subspace method ▪ Learning SVD for
each class. ▪ Inference Find the closest subspace. （We need normalize the data so that both subspaces pass through the origin） If 𝑑𝑎 < 𝑑𝑏 , 𝑐 belongs to class A. If 𝑑𝑎 > 𝑑𝑏 , 𝑐 belongs to class B.

Datasets that cannot be linearly separated ▪ There is no
plane (straight line) that reduces the dimension properly. ▪ CLAFIC is ineffective for such a dataset. Two-dimensional data space Project the data to a higher dimensional space, making the data linearly separable. 66

Kernel CLAFIC, Learning 𝐿-dimensional data space 𝑀(≫ 𝐿)- dimensional space
67

Kernel CLAFIC, Learning 𝐿-dimensional data space 𝑀(≫ 𝐿)- dimensional space
68

Kernel CLAFIC, Learning Projection onto the higher dimensional space makes
the data linearly separable. 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space 69

Kernel CLAFIC, Learning Perform SVD for each class in high-dimensional
space . 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space 70

Kernel CLAFIC, Inference Project the test data 𝒄 to the
high-dimensional space and predict the class by the closest subspace. 77 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space If 𝑑𝑎 < 𝑑𝑏 , 𝑐 belongs to class A. If 𝑑𝑎 > 𝑑𝑏 , 𝑐 belongs to class B.

No need to seek explicitly. Kernel trick 78 Large The
distance can be written by the eigenvalues and eigenvectors of the Kernel matrix: 𝑀(≫ 𝐿)-dimensional space

No need to seek explicitly. Kernel trick 79 Large The
distance can be written by the eigenvalues and eigenvectors of the Kernel matrix: Typical example of a inner product (RBF Kernel) We need the similarity between the samples. （Inner product） 𝑀(≫ 𝐿)-dimensional space

Denoising by Kernel PCA[1] [1] Mika, Sebastian, et al. "Kernel
PCA and de-noising in feature spaces." Advances in neural information processing systems 11 (1998). ▪ Denoising : Remove noise from noisy data … 28 28 … 28×28=784

Denoising by Kernel PCA[1] [1] Mika, Sebastian, et al. "Kernel
PCA and de-noising in feature spaces." Advances in neural information processing systems 11 (1998). (784) 𝐿-dimensional data space

Denoising by Kernel PCA[1] 85 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional
space

Denoising by Kernel PCA[1] 86 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional
space

Denoising by Kernel PCA[1] Project the noisy data into a
low-dimensional space in . 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space

The denoised image is obtained by the inverse mapping .
Denoising by Kernel PCA[1] 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space

Anomaly detection by Kernel PCA [2] ▪ Detect anomalous samples
automatically. Normal Normal Anomaly Normal Normal Normal Normal [2] Hoffmann, Heiko. "Kernel PCA for novelty detection." Pattern recognition 40.3 (2007): 863-874. Anomaly 89

Anomaly detection by Kernel PCA [2] [2] Hoffmann, Heiko. "Kernel
PCA for novelty detection." Pattern recognition 40.3 (2007): 863-874. ▪ Normal samples （Training data） ▪ Normal and anomalous samples（Test data）異常正常 Normal Normal Normal Normal Normal 異常 Determine whether it is normal or anomalous . 90

Anomaly detection by Kernel PCA [2] (784) Perform the Kernel
PCA with normal samples only. 93 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space

Anomaly detection by Kernel PCA [2] Next, we project the
test data to . 94 (784) 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space

Anomaly detection by Kernel PCA [2] 95 (784) 𝐿-dimensional data
space Next, we project the test data to . 𝑀(≫ 𝐿)-dimensional space

Anomaly detection by Kernel PCA [2] ：Hyper parameter (784) 𝐿-dimensional
data space 𝑀(≫ 𝐿)-dimensional space If , is predicted to be an anomalous sample.

97 Summary: Kernel CLAFIC and its various applications Kernel CLAFIC
The data is linearly separated after projection to a higher dimensional space. Application 2 : Anomaly detection Back to data space with the inverse mapping . Anomaly detection by the distance to subspace. Application 1 : Denoising Data Space Feature space

Grassmannian Learning span 𝒖1 𝐴, … , 𝒖𝑟 𝐴 span
𝒖1 𝐵, … , 𝒖𝑟 𝐵 span 𝒖1 𝐶, … , 𝒖𝑟 𝐶 span 𝒖1 , … , 𝒖𝑟 Classify subspaces by distance between subspaces Face images from Yale Face Database http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html 𝒀𝐴 𝒀𝐵 𝒀𝐶 𝑑 𝒀𝐴 , 𝒀𝐵 = 1 2 ||𝒀𝐴 𝒀𝐴 ⊤ − 𝒀𝐵 𝒀𝐵 ⊤||𝐹 Grassmannian Kernel method Useful when the number of data used for inference is not static e.g., Classification to estimate labels for each multiple facial image 𝑑 𝒀𝐴 , 𝒀𝐵 = 1 − ς 𝑖 cos2𝜃𝑖 Projection Metric Binet-Cauchy Metric = σ 𝑖 sin2𝜃𝑖 [17] Hamm, Jihun, and Daniel D. Lee. "Grassmann discriminant analysis: a unifying view on subspace-based learning." ICML. 2008. 𝑘 𝒀𝐴 , 𝒀𝐵 = det 𝒀𝐴 ⊤𝒀𝐵 2 𝑘 𝒀𝐴 , 𝒀𝐵 = ||𝒀𝐴 𝒀𝐵 ⊤||𝐹 Projection kernel Binet-Cauchy kernel 𝒀𝐴 , 𝒀𝐵 , 𝒀𝐶 ∶ Training data Each point is a linear subspace of ℝ𝑁 where N is the size of each image Principal angles obtained by SVD on 𝒀𝐴 ⊤ 𝒀𝐵

Matrix factorization with non-negative constraints ▪ SVD includes negative values
in the representation ▪ Non-negativity improves interpretability are orthogonal. SVD imposes that are non-negative. NMF imposes that 100

Example of non-negative matrix factorization [3] Lee, Daniel D., and
H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. … ： hyperparameter 101 1 2 3 2429 19 19 in this experiment, 𝑘 = 49.

Example of non-negative matrix factorization [3] Lee, Daniel D., and
H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. … ： hyperparameter 102 1 2 3 2429 19 19 in this experiment, 𝑘 = 49. 𝑛 = 19 ∗ 19, 𝑚 = 2429

The base is the facial parts such as eyes, mouth,
and nose. The face is reconstructed by the addition of the parts. PCA and NMF [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. (Positive values are drawn in black, negative values in red) Each basis is face-like. The face is reconstructed by adding up the faces. 103 Representation by NMF Representation by SVD(PCA)

The base is the facial parts such as eyes, mouth,
and nose. The face is reconstructed by the addition of the parts. PCA and NMF [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. (Positive values are drawn in black, negative values in red) Each basis is face-like. The face is reconstructed by adding up the faces. 104 Representation by NMF Representation by SVD(PCA)

Challenges of NMF Finding the best is NP-hard problem. ※
in terms of minimizing Frobenius error. Gradient based optimization of nonconvex functions. (Approximated solution depends on initial values) Hyperparameter needs to be tuned Convex Function Non-Convex Function NMF for matrix is non-convex. The error ：Hyper-parameter 105

Nonnegative multiple matrix factorization[6] 111 ▪ Extract patterns from playback
history for a music recommendation system User Singer User Tag ： Who listened to which singer? ： Tags for artists (Pop, rock, Jazz...) ： Friendship between users. [6] Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. If user i and user j are friends, then ． KL divergence ・Prevents over-training, ・Robust to outliers, Weights of auxiliary information

Nonnegative multiple matrix factorization[6] 112 ▪ Extract patterns from playback
history for a music recommendation system [6] Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. Patterns - between users and singers - between users and genres are simultaneously extracted.

113 Summary: Non-negative matrix factorization ▪ Non-negativity improves interpretability ▪
Many task-specific NMFs have been developed. ▪ The best solution cannot be obtained. The best decomposition is NP-hard problem ※ in terms of minimizing Frobenius error. Gradient based optimization of nonconvex functions (Approximated solution depends on initial values) PCA NMF Group NMF[5] NMMF[6] [5] Lee, Hyekyoung, and Seungjin Choi. "Group nonnegative matrix factorization for EEG classification." Artificial Intelligence and Statistics. PMLR, 2009. [6] Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013.

NMF with missing values ▪ Data often includes missing values
The approach of simply replacing missing values with averages or zeros doesn't work well. Estimates missing values using the assumption that the data have a low-rank structure. ：Missing value Only one value is missing for simplicity. In practice, many missing values are possible. 118

：Missing value NMF with missing values The space where each
point is a matrix ・The dimension of the subspace is the number of missing value. ・Assuming A is low rank structure + noise Low-rank matrix completion, EM-WNMF[4] Only one value is missing for simplicity. In practice, many missing values are possible. [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006.

NMF with missing values Low-rank matrix completion, EM-WNMF[4] [4] Zhang,
Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. ：Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ：Missing value

NMF with missing values Low-rank matrix completion, EM-WNMF[4] Initialize missing
values and obtain . [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. ：Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ：Missing value

NMF with missing values Low-rank matrix completion, EM-WNMF[4] M step
[4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. ：Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ：Missing value Do NMF on to obtain rank- matrix . Initialize missing values and obtain .

NMF with missing values Low-rank matrix completion, EM-WNMF[4] M step
E step Overwrite observations to obtain . ：Observed indices [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. ：Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ：Missing value Do NMF on to obtain rank- matrix . Initialize missing values and obtain .

NMF with missing values [4] Zhang, Sheng, et al. "Learning
from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. Low-rank matrix completion, EM-WNMF[4] M step E step Repeat ：Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. ：Observed indices ：Missing value Overwrite observations to obtain . Do NMF on to obtain rank- matrix . Initialize missing values and obtain .

Issues in EM-WNMF Rank needs to be chosen appropriately. Algorithm
has initial value dependence． Slow convergence. [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. The space where each point is a matrix ：Hyper-parameter Rank- matrices Low-rank matrix completion, EM-WNMF[4]

Further topics of matrix factorization ・ Probabilistic Principal Component Analysis
128 1×1 = 1 1＋1 = 1 Decomposition under the algebra with [18] Hashemi, Soheil, Hokchhay Tann, and Sherief Reda. "BLASYS: Approximate logic synthesis using Boolean matrix factorization." Proceedings of the 55th Annual Design Automation Conference. 2018. Confidence intervals can be obtained. Each element of the matrix is sampled from a probability distribution. We assume a prior distribution. Input Output ・ Binary Matrix Factorization ，Boolean matrix factorization ・Autoencoder and Principal Component Analysis The error function of a linear autoencoder with parameter sharing is equivalent to PCA. The activation is given as f(x) = x. The width of the middle layer corresponds to the rank. Matrix rank 𝐖 𝐖⊤ often used in recommendation engine

Further topics of matrix factorization ・Deep matrix factorization [19] Fan,
Jicong. "Multi-mode deep matrix and tensor factorization." ICLR, 2021. Activation function ・ Double-descent phenomena in NMF 129 ≃ 𝐗 ∈ ℝ𝑛×𝑚 𝐖 ∈ ℝ𝑛×𝑟 𝐇 ∈ ℝ𝑟×𝑚 [20] Kawakami, Y., et.al.,“Investigating Overparameterization for Non-Negative Matrix Factorization in Collaborative Filtering” RecSys 2021, Late-Breaking Results track, 2021. → Larger rank 𝑟 Over fitting

Various data are stored in computers as tensors. 131 NASA,
https://sdo.gsfc.nasa.gov/data/ Microarray data RGB Image Hyperspectral Image EEG Data = [14] M.Mørup, Data Mining and Knowledge Discovery 2011 [16] A.Cichocki, et al. Nonnegative Matrix and Tensor Factorizations 2009 [16] A.Cichocki, et al. Nonnegative Matrix and Tensor Factorizations 2009 Video Time series data, Signals [16] A.Cichocki, et al. Nonnegative Matrix and Tensor Factorizations 2009 [15] Kosmas Dimitropoulos, et al. Transactions on circuits and systems for video technology 2018

Tensors in modern machine learning 132 Relational Learning A B
C E D F G ？ [7] e.g., Graph link prediction by RESCAL Model Express by tensors [8] Deep Learning e.g., Lightweighting Neural Networks with Tensor Decomposition [9] [7] Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel. “A three-way model for collective learning on multi-relational data.” ICML2011 [8] T. Bezdan, N. Bačanin Džakula, International Scientific Conference on Information Technology and Data Related Research, 2019 [9] Y. Liu et al., Tensor Computation for data analysis, 2022

Tensor Hierarchy: Vectors of a Tensor are also Tensors Numbers
Vectors [First-order tensors] Matrices [Second-order tensors] 𝑎 𝒂 = 𝑎1 𝑎2 ⋮ 𝑎𝑛 𝐗 = 𝑎11 ⋯ 𝑎1𝑛 ⋮ ⋱ ⋮ 𝑎𝑛1 ⋯ 𝑎𝑛𝑛 Third-order tensors ⋯ 𝒯= Fourth-order tensors 𝑈= ⋯ ⋯ 𝒯 1 𝒯 2 𝒯 𝑛 133

Density estimation, pattern extraction, data mining, denoising, data compression, anomaly
detection… Extracting features from tensors. 134 Who Shop ≃ = ത 𝒫 minmize 𝒫 − ത 𝒫 𝐹 + ⋯ + 𝒫 We want to extract pattern or feature in tensor formatted data ▪ Tensor decomposition extracts features from data. mode-２ mode-１

Difficulties in tensor factorization ▪ We need to decide both
of a structure and a rank. CP decomp. ≃ = 𝒫 ത 𝒫 ≃ Tucker decomp. Decomp. with tensor networks + ⋯ + minmize 𝒫 − ത 𝒫 𝐹 Tensor Train decomposion Tensor ring decomposion 𝒫 Tensor networks Each node is a tensor, Each edge is a mode [11] Wang, Wenqi, et al. CVPR. 2018. [12] Zheng, Yu-Bang, et al. AAAI 2021. https://tensornetwork.org [13] A. Cichocki, et al., Tensor Networks for Dimensionality Reduction and Large-scale Optimization, 2016

Difficulties in tensor factorization 136 ▪ We need to decide
both of a structure and a rank. CP decomp. ≃ = 𝒫 ത 𝒫 ▪ Optimization is difficult. Tensor decomposition is often ill-posed or NP-hard. ・ The best rank-1 CP-decomp. for minimizing Frobenius norm is NP-hard. ≃ Tucker decomp. Decomp. with tensor networks + ⋯ + A convex, stable and intuitive designable tensor decomposition is desired. minmize 𝒫 − ത 𝒫 𝐹 Non-convex 𝒫 − ത 𝒫 𝐹 Solution often might be indeterminate. The objective function is typically non-convex. ・Initial values dependency No guarantee to be the best solution. Tensor Train decomposion Tensor ring decomposion 𝒫

Many-body approximation for non-negative tensors[10] 137 Natural parameter of exponential
distribution family. Energy function

Many-body approximation for non-negative tensors[10] 138 Natural parameter of exponential
distribution family. Energy function

Many-body approximation for non-negative tensors[10] 139 Control relation between mode-k
and mode-l. Control relation among mode-j, -k and -l. Natural parameter of exponential distribution family. Energy function

Many-body approximation for non-negative tensors[10] 140 One-body approx. Rank-1 approximation
（mean-field approximation）

（mean-field approximation） Two-body approx. Control relation between mode-k and mode-l. Larger Capability Two-body Interaction

（mean-field approximation） Two-body approx. Two-body Interaction Three-body approx. Larger Capability The global optimal solution minimizing KL divergence from can be obtained by a convex optimization. Intuitive modeling focusing on interactions between modes Control relation among mode-j, -k and -l. Control relation between mode-k and mode-l. Three-body Interaction

We regard a normalized tensor as a discrete joint probability
distribution whose sample space is an index set Coordinates transformation Theoretical idea behind many-body approx. 143 Index is discrete random variable 𝜽- Representation Representation with natural parameters of exponential family Geometry of 𝜽-space Low-rank space （Not flat） Flat subspace Difficult to optimize We construct flat subspace by focusing interaction among tensor modes Describing tensor factorization in 𝜽-coordinate system makes it convex problem We use information geometry to formulate factorization as convex problem

Example tensor reconstruction by many-body approx. Larger capability Reconstruction for
40×40×3×10 tensor. (width, height, colors, # images) Color depends on image index Shape of each image Color is uniform within each image. Intuitive model designable that captures the relationship between modes Color depends on pixel Three-body Approx. 152

Color image is decomposed into shape × color Approximation

Color image is decomposed into shape × color × ≃
= Shape of each image Color of each image

Data completion by many-body approx. and the em-algorithm m-step 𝐏
← low_rank_approx(𝐏) e-step 𝐏𝜏 ← 𝐓𝜏 𝜏 : observed indices Low-rank Tensor Completion 𝑒-projection 𝑚-projection Not flat. It causes non-uniquness of projection. Which structure is better for decomposition? How to chose rank? Low-rank space 155

Data completion by many-body approx. and the em-algorithm m-step 𝐏
← low_rank_approx(𝐏) e-step 𝐏𝜏 ← 𝐓𝜏 𝜏 : observed indices Low-rank Tensor Completion 𝑒-projection Low-body many_body_approx Low-body space is flat. It ensures uniquness of the projection Intuitive modeling is possible! 𝑚-projection 156

Data completion by many-body approx. and the em-algorithm Missing area
GT Prediction Interaction above correlated modes Reconstructing traffic speed data of 28×24×12×4 tensor with missing values. (days, hours, min, lanes) 157

Data completion by many-body approx. and the em-algorithm Missing area
GT Prediction Interaction above correlated modes Interaction above non-correlated modes Fit score: 0.82 Reconstructing traffic speed data of 28×24×12×4 tensor with missing values. (days, hours, min, lanes) 158

159 Summary: Many-body approximation for tensors https://arxiv.org/abs/2209.15338 ・More intuitive design
than rank tuning One-body Approx. Three-body Approx. Two-body Approx. Many-body Approximation Bias (magnetic field) Weight (electron-electron interaction) ・Convex optimization always provide unique solution

160 Report assignments 1. Prove the Eckart-Young or Eckart-Young-Mirsky Theorem.
2. There are many algorithms for Non-negative Matrix Factorization, e.g., HALS-NMF, MU-NMF, etc. Choose one of them and derive the algorithm. You do not have to show the convergence guarantee. There are four assignments below. Choose one of them and write your report. Please indicate which assignment you have chosen. The format is flexible but should be less than 2 sheets of A4 size paper. Handwriting is also acceptable. Note: There are many variation of NMF, such as: - NMF optimizing Frobenius norm - NMF optimizing KL divergence - NMF optimizing Frobenius norm with regularization term. Please indicate which kind of NMF you are discussing.

161 Report assignments 3. Visit Google Scholar (https://scholar.google.com/) and search
for a paper related to SVD, NMF, or tensor factorization. Summarize the motivation, method, and novelty of the paper. Note: Please clearly indicate the paper title you chose. Example papers: - Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." nature 401.6755 (1999): 788-791. - Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. - Lee, Hyekyoung, and Seungjin Choi. "Group nonnegative matrix factorization for EEG classification." Artificial Intelligence and Statistics. PMLR, 2009. - Ding, Chris, Xiaofeng He, and Horst D. Simon. "On the equivalence of nonnegative matrix factorization and spectral clustering." Proceedings of the 2005 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2005. - Lee, Daniel, and H. Sebastian Seung. "Algorithms for non-negative matrix factorization." Advances in neural information processing systems 13 (2000). - Zhang, Jun, Yong Yan, and Martin Lades. "Face recognition: eigenface, elastic matching, and neural nets." Proceedings of the IEEE 85.9 (1997): 1423-1435. Recommended keywords: - Non-negative matrix factorization, SVD, Tensor decomposition, Eigen face, weighted NMF, low-rank completion

162 Report assignments 4. Follow the steps below using Python,
Matlab, R, C/C++, Fortran, or Julia: (1) Convert a grayscale image into a two-dimensional array (matrix). (2) Perform the rank-r approximation on it using SVD. (3) Plot a figure with rank on the horizontal axis and reconstruction error on the vertical axis. Observe how the reconstructed image changes as the rank is varied. (4) Plot a figure of the singular value distribution obtained by SVD for the image. The vertical axis of the plot is the nth largest singular value, and the horizontal axis is n. Rank Reconstruction Error n n-th singular value Note: Please paste a part of the source code you have written into the report. You do not have to paste all source code.

For further study グラフィカルユーザーインターフェ
イス自動的に生成された説明テキストが含まれている画像自動的に生成された説明 ▪ Textbooks for Matrix and Tensor Decomposition A series lecture by Dr. Steve Brunton ・Speeding up SVD ・SVD for linear regression ・SVD for face recognition ・How to chose ranks ▪ Tensor libraries 時計 , メーター , ボール , 記号が含まれている画像自動的に生成された説明ロゴ自動的に生成された説明箱ひげ図が含まれている画像自動的に生成された説明 [iTensor] カレンダー中程度の精度で自動的に生成された説明 … 166

167 References [1] Mika, Sebastian, et al. "Kernel PCA and
de-noising in feature spaces." Advances in neural information processing systems 11 (1998). [2] Hoffmann, Heiko. "Kernel PCA for novelty detection." Pattern recognition 40.3 (2007): 863-874. [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization.“ Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. [5] Lee, Hyekyoung, and Seungjin Choi. "Group nonnegative matrix factorization for EEG classification.“ Artificial Intelligence and Statistics. PMLR, 2009. [6] Takeuchi Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. [7] Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel. “A three-way model for collective learning on multi-relational data.” ICML2011 [8] T. Bezdan, N. Bačanin Džakula, International Scientific Conference on Information Technology and Data Related Research, 2019 [9] Y. Liu et al., "Tensor Computation for data analysis", 2022 [10] Ghalamkari Kazu, Mahito Sugiyama, and Yoshinobu Kawahara. "Many-body approximation for non-negative tensors.“ Advances in Neural Information Processing Systems 36 (2024). [11] Wang, Wenqi, et al. "Wide compression: Tensor ring nets." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [12] Zheng, Yu-Bang, et al. "Fully-connected tensor network decomposition and its application to higher-order tensor completion." Proceedings of the AAAI conference on artificial intelligence. Vol. 35. No. 12. 2021. [13] Cichocki, Andrzej, et al. "Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions." Foundations and Trends® in Machine Learning 9.4-5 (2016): 249-429.

168 References [14] Mørup, Morten. "Applications of tensor (multiway array)
factorizations and decompositions in data mining." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.1 (2011): 24-40. [15] Dimitropoulos, Kosmas, et al. "Classification of multidimensional time-evolving data using histograms of grassmannian points." IEEE Transactions on Circuits and Systems for Video Technology 28.4 (2016): 892-905. [16] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. Non-negative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009. [17] Hamm, Jihun, and Daniel D. Lee. "Grassmann discriminant analysis: a unifying view on subspace-based learning." ICML. 2008. [18] Hashemi, Soheil, Hokchhay Tann, and Sherief Reda. “BLASYS: Approximate logic synthesis using Boolean matrix factorization.” Proceedings of the 55th Annual Design Automation Conference. 2018. [19] Fan, Jicong. "Multi-mode deep matrix and tensor factorization." ICLR, 2021. [20] Kawakami, Y., et.al.,“Investigating Overparameterization for Non-Negative Matrix Factorization in Collaborative Filtering” RecSys 2021, Late-Breaking Results track, 2021. Acknowledgements I would like to express my gratitude to Dr. Profir-Petru Pârțachi from the National Institute of Informatics for his valuable comments on the structure and content of this lecture slides. Used datasets MNIST https://yann.lecun.com/exdb/mnist/ COIL100 https://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php IRIS https://archive.ics.uci.edu/dataset/53/iris Yale Face Database http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html

Matrix and Tensor Factorization for Machine Lea...

Matrix and Tensor Factorization for Machine Learning

More Decks by Kazu Ghalamkari

Other Decks in Technology

Featured

Transcript