Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Matrix and Tensor Factorization for Machine Lea...

Matrix and Tensor Factorization for Machine Learning

You can donwload the slide here.

Matrices and tensors, or multidimensional arrays, are highly versatile data structures. They can store a variety of formats: image, video, table, and sensing data. By decomposing such matrices and tensors, we can extract insight from the data: patterns, features, etc. In this lecture, I introduce singular value decomposition (SVD) for matrices, non-negative matrix factorization (NMF) for non-negative matrices, and CP and Tucker decomposition for tensors. The objective is not only to understand the various decomposition methods in a piecemeal manner but also to acquire the know-how to select the appropriate decomposition method according to the situation and various constraints in real-world scenarios while focusing on the properties of each decomposition. For downstream applications of the methods, the lecture will cover subspace methods (CLAFIC) for classification tasks and EM-based methods for determining missing values in given input data. In the last segment of the lecture, I will cover some of the difficulties of tensor decomposition (TD). TD suffers from difficulties that do not occur in SVD, such as ill-posedness and NP-hardness in optimization, and will discuss recent research trends for avoiding these difficulties.

Kazu Ghalamkari

July 26, 2024
Tweet

More Decks by Kazu Ghalamkari

Other Decks in Technology

Transcript

  1. Matrix and Tensor Factorization for Machine Learning Special Topics in

    Mechano-Informatics Ⅱ@ Tokyo University, 2024 Kazu Ghalamkari RIKEN AIP @KazuGhalamkari
  2. 3 3 Announcements Please click the raise hand button if

    you can see the slides and hear me clearly. Please refrain from redistributing slides. Feel free to ask any questions using the chat window during the lecture. 🗎 The assignment is available on the final section. Some parts of the slide will be skipped due to time limitation.
  3. 4 Overview ▪ Introduction: Why do we decompose data? ▪

    A quick review of matrix rank ▪ Singular value decomposition (SVD) and low-rank approximation ▪ Kernel subspace method and its applications (denoising, anomaly detection) ▪ Non-negative matrix factorization ▪ Tensor low-rank decomposition and many-body approximation
  4. Diverse real-world data ▪ Purchase history ▪ Tabular data Sepal

    Length [cm] Sepal Width [cm] Petal Length [cm] Petal Width [cm] Species 5.1 3.5 1.4 0.2 setosa 7 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 6.5 2.8 4.6 1.5 versicolor 6.6 2.9 4.6 1.3 versicolor 4.9 3 1.4 0.2 setosa 5.2 2.7 3.9 1.4 versicolor ▪ Grayscale image Decomposing the data as a matrix is beneficial ▪ Spectrum information Image from https://www.mathworks.com/help/images/image-types-in-the-toolbox_ja_JP.html Image from https://sigview.com/help/Time-FFTSpectrogram.html Image from Mithy, S. A., et al. "Classification of Iris Flower Dataset using Different Algorithms." Int. J. Sci. Res. In (2022).
  5. What is a good decomposition? ・Is the decomposed representation interpretable?

    ・Is the decomposition scalable to large data? ・Is the decomposed representation unique? ・Can the decomposition be done even with missing values? Many Choose appropriate decomposition method by considering real-world constraints Many
  6. 11 Overview ▪ Introduction: Why do we decompose data? ▪

    A quick review of matrix rank ▪ Singular value decomposition (SVD) and low-rank approximation ▪ Kernel subspace method and its applications (denoising, anomaly detection) ▪ Non-negative matrix factorization ▪ Tensor low-rank decomposition and many-body approximation
  7. Rank in linear algebra Each column is a constant multiple

    of vector . We call such a matrix a rank-1 matrix. 6-dimensional vector space 12
  8. Rank in linear algebra Each row is a constant multiple

    of vector . We call such a matrix a rank-1 matrix. 6-dimensional vector space 13
  9. Rank in linear algebra Each column of this matrix is

    written by a linear combination of the base . The rank-2 matrix is the matrix that is a linear combination of two linearly independent vectors. Each column vector is in a 2-dimensional plane in the 6-dimensional vector space. Plane spanned by 6-dimensional vector space 15
  10. Rank in linear algebra When the column vector of A

    can be written by linear combination of , is the rank of matrix . Linear independent 𝑟 vectors (Basis) the rank of matrix A is the minimum integer number of 𝑟. A matrix whose rank is 𝑟 is called a rank 𝑟 matrix. Def. 16
  11. Properties of the matrix rank Basis A is full rank

    when . A is rank deficient if . Rank Property 18
  12. 21 Summary: matrix rank Basis Rank Properties When the column

    vector of A can be written by linear combination of , the rank of matrix A is the minimum integer number of 𝑟. Def.
  13. 22 Overview ▪ Introduction: Why do we decompose data? ▪

    A quick review of matrix rank ▪ Singular value decomposition (SVD) and low-rank approximation ▪ Kernel subspace method and its applications (denoising, anomaly detection) ▪ Non-negative matrix factorization ▪ Tensor low-rank decomposition and many-body approximation
  14. the base could be any linearly independent vectors. Singular value

    decomposition, SVD So far, we assumed that Imposes orthonormality on 23
  15. Singular value decomposition, SVD Most significant rank-1 factor Least significant

    rank-1 factor Important term Unimportant term Imposes orthonormality on : weight (importance) of the following rank-1 matrix Singular value
  16. Singular value decomposition, SVD ▪ Any matrix A can be

    decomposed into the product of U, V, and a diagonal matrix Σ It can be a complex, non-singular, or rectangular matrix 26
  17. Singular value decomposition, SVD ▪ Any matrix A can be

    decomposed into the product of U, V, and a diagonal matrix Σ 27 It can be a complex, non-singular, or rectangular matrix
  18. Singular value decomposition, SVD Rank and Singular Value 0 Unnecessary

    The rank of matrix A is the number of non-zero singular values of A ▪ A is a rank- matrix 29
  19. Low-rank approximation by SVD They are not as important, so

    let's set them to 0 and ignore them. Important term Unimportant term ≒ 0 ▪ Decomposing rank-𝑟 matrix 𝐀 into orthogonal 𝐔 and 𝐕 and diagonal matrix ∑. Singular value : weight of the following rank-1 matrix 30
  20. Low-rank approximation by SVD ≒ 0 ▪ Decomposing rank-𝑟 matrix

    𝐀 into orthogonal 𝐔 and 𝐕 and diagonal matrix ∑. Approximating a matrix by a smaller rank matrix Low-rank approximation They are not as important, so let's set them to 0 and ignore them. Singular value : weight of the following rank-1 matrix 31
  21. The BEST Low-rank approximation by SVD ▪ SVD provides the

    best low-rank approximation minimizing Frobenius norm. is the best rank-k approximation for 𝐀, which minimizes Frobenius norm. Evaluate how close they are via the Frobenius norm. Frobenius norm Eckart-Young Theorem (1936) 32
  22. The BEST Low-rank approximation by SVD ▪ SVD is the

    best low-rank approximation minimizing unitary invariant norm. is the best rank-r matrix for 𝐀, which minimizes any unitary invariant norm. Evaluate how similar they are via unitary invariant norm. Eckart-Young-Mirsky Theorem (1960) * Unitary invariant norm ⋅ ∗ satisfies that 𝐏 ∗ = 𝐗𝐏𝐘 ∗ for any matrix P and any unitary matrices X and Y. 33
  23. Low-rank approximation of matrices saves memory. Required memory before the

    approximation Required memory sizes after approximation Low-rank approximation reduces required memory storage Example: Assuming is sufficiently smaller than . 34
  24. Singular values and eigenvalues Put into The 𝑖-th singular value

    𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ 35
  25. Singular values and eigenvalues Each column vector is orthonormal Put

    into The 𝑖-th singular value 𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ 36
  26. Singular values and eigenvalues Each column vector is orthonormal Put

    into The 𝑖-th singular value 𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ 37
  27. Singular values and eigenvalues Each column vector is orthonormal Put

    into The 𝑖-th singular value 𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ 38
  28. The 𝑖-th singular value 𝜎𝑖 of matrix 𝐀 is the

    square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ Eigenvalue of Eigenvector of Singular values and eigenvalues Each column vector is orthonormal Put into 𝑙 -th column of U 𝑙 -th column of V (𝑙, 𝑙)-entry of ∑
  29. Singular values and eigenvalues Eigenvector of Put into Eigenvalue of

    Eigenvector of The 𝑖-th singular value 𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ Put into 𝑙 -th column of U 𝑙 -th column of V (𝑙, 𝑙)-entry of ∑
  30. Singular values and eigenvalues (Left singular vector) The 𝑖-th singular

    value 𝜎𝑖 of matrix 𝐀 is the square root of the 𝑖-th eigenvalue λ𝑖 of matrix 𝐀𝐀⊤ SVD of A is an eigenvalue decomposition of 𝐀𝐀⊤ and 𝐀⊤𝐀 The rank of matrix 𝐀 equals the number of non-zero singular values of 𝐀. Each column vector of 𝐕 is an eigenvector of 𝐀⊤𝐀. Each column vector of 𝐔 is an eigenvector of 𝐀𝐀⊤. (Right singular vector)
  31. Example of low-rank approximation by SVD in Python Smaller rank

    𝑘 worsens the approximation performance. 46
  32. Image reconstruction by SVD How many ranks are needed for

    reconstruction? 2000×1500 (2000+1500)×5 (2000+1500)×20 (2000+1500)×100 𝑘=100, 11.67% storage 𝑘=20, 2.33% storage 𝑘=5, 0.57% storage テキストが含まれている画像 自動的に生成さ れた説明 Image from Steven L. Brunton, J. Nathan Kutz, “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” Image from Steven L. Brunton, J. Nathan Kutz, “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” 47
  33. Hyperparameter tuning in SVD Accurate Rough Increased memory Slow Memory

    saving High speed (2000+1500)×5 (2000+1500)×20 (2000+1500)×100 Selecting the appropriate rank is a trade-off problem. It requires repeated trial and error. A typical hyper-parameter tuning problem 𝑘=100, 11.67% storage 𝑘=20, 2.33% storage 𝑘=5, 0.57% storage テキストが含まれている画像 自動的に生成さ れた説明 Image from Steven L. Brunton, J. Nathan Kutz, “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” Image from Steven L. Brunton, J. Nathan Kutz, “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” 48
  34. Data are represented by a low-rank structure plus noise. Most

    of the data in this world has a low-rank structure. Real data Low-rank matrix Noise Low-rank approximation is good enough for many datasets. ▪ Low-rank approximation can approximate real-world data well. 49
  35. 50 Singular values Summary: Low-rank approximation and SVD ▪ SVD

    can be applied to any matrix. ▪ SVD finds the best low-rank matrix minimizing the Frobenius norm. are orthonormal. SVD provides the best rank-k approximation that minimizes the Frobenius norm. Eckart − Young Theorem ▪ Low-rank approximation reduces data memory requirements.
  36. 52 Overview ▪ Introduction: Why do we decompose data? ▪

    A quick review of matrix rank ▪ Singular value decomposition (SVD) and low-rank approximation ▪ Kernel subspace method and its applications (denoising, anomaly detection) ▪ Non-negative matrix factorization ▪ Tensor low-rank decomposition and many-body approximation
  37. Classification of Iris Data Sepal Petal Sepal Petal ▪ Iris

    Dataset[1] [1] Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188. We estimate the species of iris data-points based on their sepal and petal lengths. Sepal Length [cm] Sepal Width [cm] Petal Length [cm] Petal Width [cm] Species 5.1 3.5 1.4 0.2 setosa 7 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 6.5 2.8 4.6 1.5 versicolor 6.3 3.3 4.7 1.6 versicolor 6.6 2.9 4.6 1.3 versicolor 4.9 3 1.4 0.2 setosa 5.2 2.7 3.9 1.4 versicolor 5.9 3 4.2 1.5 ??? 5.6 3 4.5 1.5 ??? 4.7 3.2 1.6 0.2 ??? … … Image from Mithy, S. A., et al. "Classification of Iris Flower Dataset using Different Algorithms." Int. J. Sci. Res. In (2022). 53
  38. Classification by the CLAFIC method Samples of class A Samples

    of class B Estimate class of each sample Classification Training data setosa versicolor Sepal Length [cm] Sepal Width [cm] Petal Length [cm] Petal Width [cm] Species 5.1 3.5 1.4 0.2 setosa 7 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 6.5 2.8 4.6 1.5 versicolor 6.3 3.3 4.7 1.6 versicolor 6.6 2.9 4.6 1.3 versicolor 4.9 3 1.4 0.2 setosa 5.2 2.7 3.9 1.4 versicolor 5.9 3 4.2 1.5 ??? 5.6 3 4.5 1.5 ??? 4.7 3.2 1.6 0.2 ??? … … 54
  39. Classification by the CLAFIC method Sepal Length [cm] Petal Width

    [cm] Sepal Width [cm] Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor 55
  40. Classification by the CLAFIC method Samples of class A Samples

    of class B Estimate class of each sample Classification Training data setosa versicolor Sepal Length [cm] Petal Width [cm] Sepal Width [cm] 56
  41. Classification by the CLAFIC method Samples of class A Samples

    of class B Estimate class of each sample Classification Training data setosa versicolor 57
  42. Classification by the CLAFIC method Samples of class A Samples

    of class B Estimate class of each sample Classification Training data setosa versicolor 58
  43. Classification by the CLAFIC method Samples of class A Samples

    of class B Estimate class of each sample Classification Training data setosa versicolor 59
  44. Classification by the CLAFIC method Samples of class A Samples

    of class B Estimate class of each sample Classification Training data setosa versicolor 60
  45. Classification by the CLAFIC method (We need normalize the data

    so that both subspaces pass through the origin) Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor
  46. Classification by the CLAFIC method Orthogonal basis Orthogonal basis :hyper-parameter

    Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor (We need normalize the data so that both subspaces pass through the origin)
  47. Classification by the CLAFIC method ▪ Inference Find the closest

    subspace. ▪ Learning SVD for each class samples. Samples of class A Samples of class B Estimate class of each sample Classification Training data setosa versicolor (We need normalize the data so that both subspaces pass through the origin) If 𝑑𝑎 < 𝑑𝑏 , 𝑐 belongs to class A. If 𝑑𝑎 > 𝑑𝑏 , 𝑐 belongs to class B.
  48. 1-Neighborhood and CLAFIC 1-NN Classification by the nearest training data

    Classification by the nearest subspace CLAFIC If 𝑑𝑎 < 𝑑𝑏 , 𝑐 belongs to class A. If 𝑑𝑎 > 𝑑𝑏 , 𝑐 belongs to class B. 64
  49. 65 Summary: Classification by subspace method ▪ Learning SVD for

    each class. ▪ Inference Find the closest subspace. (We need normalize the data so that both subspaces pass through the origin) If 𝑑𝑎 < 𝑑𝑏 , 𝑐 belongs to class A. If 𝑑𝑎 > 𝑑𝑏 , 𝑐 belongs to class B.
  50. Datasets that cannot be linearly separated ▪ There is no

    plane (straight line) that reduces the dimension properly. ▪ CLAFIC is ineffective for such a dataset. Two-dimensional data space Project the data to a higher dimensional space, making the data linearly separable. 66
  51. Kernel CLAFIC, Learning Projection onto the higher dimensional space makes

    the data linearly separable. 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space 69
  52. Kernel CLAFIC, Learning Perform SVD for each class in high-dimensional

    space . 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space 70
  53. Kernel CLAFIC, Inference Project the test data 𝒄 to the

    high-dimensional space and predict the class by the closest subspace. 77 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space If 𝑑𝑎 < 𝑑𝑏 , 𝑐 belongs to class A. If 𝑑𝑎 > 𝑑𝑏 , 𝑐 belongs to class B.
  54. No need to seek explicitly. Kernel trick 78 Large The

    distance can be written by the eigenvalues and eigenvectors of the Kernel matrix: 𝑀(≫ 𝐿)-dimensional space
  55. No need to seek explicitly. Kernel trick 79 Large The

    distance can be written by the eigenvalues and eigenvectors of the Kernel matrix: Typical example of a inner product (RBF Kernel) We need the similarity between the samples. (Inner product) 𝑀(≫ 𝐿)-dimensional space
  56. Denoising by Kernel PCA[1] [1] Mika, Sebastian, et al. "Kernel

    PCA and de-noising in feature spaces." Advances in neural information processing systems 11 (1998). ▪ Denoising : Remove noise from noisy data … 28 28 … 28×28=784
  57. Denoising by Kernel PCA[1] [1] Mika, Sebastian, et al. "Kernel

    PCA and de-noising in feature spaces." Advances in neural information processing systems 11 (1998). (784) 𝐿-dimensional data space
  58. Denoising by Kernel PCA[1] Project the noisy data into a

    low-dimensional space in . 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space
  59. The denoised image is obtained by the inverse mapping .

    Denoising by Kernel PCA[1] 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space
  60. Anomaly detection by Kernel PCA [2] ▪ Detect anomalous samples

    automatically. Normal Normal Anomaly Normal Normal Normal Normal [2] Hoffmann, Heiko. "Kernel PCA for novelty detection." Pattern recognition 40.3 (2007): 863-874. Anomaly 89
  61. Anomaly detection by Kernel PCA [2] [2] Hoffmann, Heiko. "Kernel

    PCA for novelty detection." Pattern recognition 40.3 (2007): 863-874. ▪ Normal samples (Training data) ▪ Normal and anomalous samples(Test data) 異常 正常 Normal Normal Normal Normal Normal 異常 Determine whether it is normal or anomalous . 90
  62. Anomaly detection by Kernel PCA [2] (784) Perform the Kernel

    PCA with normal samples only. 93 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space
  63. Anomaly detection by Kernel PCA [2] Next, we project the

    test data to . 94 (784) 𝐿-dimensional data space 𝑀(≫ 𝐿)-dimensional space
  64. Anomaly detection by Kernel PCA [2] 95 (784) 𝐿-dimensional data

    space Next, we project the test data to . 𝑀(≫ 𝐿)-dimensional space
  65. Anomaly detection by Kernel PCA [2] :Hyper parameter (784) 𝐿-dimensional

    data space 𝑀(≫ 𝐿)-dimensional space If , is predicted to be an anomalous sample.
  66. 97 Summary: Kernel CLAFIC and its various applications Kernel CLAFIC

    The data is linearly separated after projection to a higher dimensional space. Application 2 : Anomaly detection Back to data space with the inverse mapping . Anomaly detection by the distance to subspace. Application 1 : Denoising Data Space Feature space
  67. Grassmannian Learning span 𝒖1 𝐴, … , 𝒖𝑟 𝐴 span

    𝒖1 𝐵, … , 𝒖𝑟 𝐵 span 𝒖1 𝐶, … , 𝒖𝑟 𝐶 span 𝒖1 , … , 𝒖𝑟 Classify subspaces by distance between subspaces Face images from Yale Face Database http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html 𝒀𝐴 𝒀𝐵 𝒀𝐶 𝑑 𝒀𝐴 , 𝒀𝐵 = 1 2 ||𝒀𝐴 𝒀𝐴 ⊤ − 𝒀𝐵 𝒀𝐵 ⊤||𝐹 Grassmannian Kernel method Useful when the number of data used for inference is not static e.g., Classification to estimate labels for each multiple facial image 𝑑 𝒀𝐴 , 𝒀𝐵 = 1 − ς 𝑖 cos2𝜃𝑖 Projection Metric Binet-Cauchy Metric = σ 𝑖 sin2𝜃𝑖 [17] Hamm, Jihun, and Daniel D. Lee. "Grassmann discriminant analysis: a unifying view on subspace-based learning." ICML. 2008. 𝑘 𝒀𝐴 , 𝒀𝐵 = det 𝒀𝐴 ⊤𝒀𝐵 2 𝑘 𝒀𝐴 , 𝒀𝐵 = ||𝒀𝐴 𝒀𝐵 ⊤||𝐹 Projection kernel Binet-Cauchy kernel 𝒀𝐴 , 𝒀𝐵 , 𝒀𝐶 ∶ Training data Each point is a linear subspace of ℝ𝑁 where N is the size of each image Principal angles obtained by SVD on 𝒀𝐴 ⊤ 𝒀𝐵
  68. 99 Overview ▪ Introduction: Why do we decompose data? ▪

    A quick review of matrix rank ▪ Singular value decomposition (SVD) and low-rank approximation ▪ Kernel subspace method and its applications (denoising, anomaly detection) ▪ Non-negative matrix factorization ▪ Tensor low-rank decomposition and many-body approximation
  69. Matrix factorization with non-negative constraints ▪ SVD includes negative values

    in the representation ▪ Non-negativity improves interpretability are orthogonal. SVD imposes that are non-negative. NMF imposes that 100
  70. Example of non-negative matrix factorization [3] Lee, Daniel D., and

    H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. … : hyperparameter 101 1 2 3 2429 19 19 in this experiment, 𝑘 = 49.
  71. Example of non-negative matrix factorization [3] Lee, Daniel D., and

    H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. … : hyperparameter 102 1 2 3 2429 19 19 in this experiment, 𝑘 = 49. 𝑛 = 19 ∗ 19, 𝑚 = 2429
  72. The base is the facial parts such as eyes, mouth,

    and nose. The face is reconstructed by the addition of the parts. PCA and NMF [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. (Positive values are drawn in black, negative values in red) Each basis is face-like. The face is reconstructed by adding up the faces. 103 Representation by NMF Representation by SVD(PCA)
  73. The base is the facial parts such as eyes, mouth,

    and nose. The face is reconstructed by the addition of the parts. PCA and NMF [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. (Positive values are drawn in black, negative values in red) Each basis is face-like. The face is reconstructed by adding up the faces. 104 Representation by NMF Representation by SVD(PCA)
  74. Challenges of NMF Finding the best is NP-hard problem. ※

    in terms of minimizing Frobenius error. Gradient based optimization of nonconvex functions. (Approximated solution depends on initial values) Hyperparameter needs to be tuned Convex Function Non-Convex Function NMF for matrix is non-convex. The error :Hyper-parameter 105
  75. Nonnegative multiple matrix factorization[6] 111 ▪ Extract patterns from playback

    history for a music recommendation system User Singer User Tag : Who listened to which singer? : Tags for artists (Pop, rock, Jazz...) : Friendship between users. [6] Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. If user i and user j are friends, then . KL divergence ・Prevents over-training, ・Robust to outliers, Weights of auxiliary information
  76. Nonnegative multiple matrix factorization[6] 112 ▪ Extract patterns from playback

    history for a music recommendation system [6] Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. Patterns - between users and singers - between users and genres are simultaneously extracted.
  77. 113 Summary: Non-negative matrix factorization ▪ Non-negativity improves interpretability ▪

    Many task-specific NMFs have been developed. ▪ The best solution cannot be obtained. The best decomposition is NP-hard problem ※ in terms of minimizing Frobenius error. Gradient based optimization of nonconvex functions (Approximated solution depends on initial values) PCA NMF Group NMF[5] NMMF[6] [5] Lee, Hyekyoung, and Seungjin Choi. "Group nonnegative matrix factorization for EEG classification." Artificial Intelligence and Statistics. PMLR, 2009. [6] Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013.
  78. 114 Overview ▪ Introduction: Why do we decompose data? ▪

    A quick review of matrix rank ▪ Singular value decomposition (SVD) and low-rank approximation ▪ Kernel subspace method and its applications (denoising, anomaly detection) ▪ Non-negative matrix factorization ▪ Tensor low-rank decomposition and many-body approximation
  79. NMF with missing values ▪ Data often includes missing values

    The approach of simply replacing missing values with averages or zeros doesn't work well. Estimates missing values using the assumption that the data have a low-rank structure. :Missing value Only one value is missing for simplicity. In practice, many missing values are possible. 118
  80. :Missing value NMF with missing values The space where each

    point is a matrix ・The dimension of the subspace is the number of missing value. ・Assuming A is low rank structure + noise Low-rank matrix completion, EM-WNMF[4] Only one value is missing for simplicity. In practice, many missing values are possible. [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006.
  81. NMF with missing values Low-rank matrix completion, EM-WNMF[4] [4] Zhang,

    Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. :Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. :Missing value
  82. NMF with missing values Low-rank matrix completion, EM-WNMF[4] Initialize missing

    values and obtain . [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. :Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. :Missing value
  83. NMF with missing values Low-rank matrix completion, EM-WNMF[4] M step

    [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. :Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. :Missing value Do NMF on to obtain rank- matrix . Initialize missing values and obtain .
  84. NMF with missing values Low-rank matrix completion, EM-WNMF[4] M step

    E step Overwrite observations to obtain . :Observed indices [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. :Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. :Missing value Do NMF on to obtain rank- matrix . Initialize missing values and obtain .
  85. NMF with missing values [4] Zhang, Sheng, et al. "Learning

    from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. Low-rank matrix completion, EM-WNMF[4] M step E step Repeat :Hyper-parameter Rank- matrices The space where each point is a matrix Only one value is missing for simplicity. In practice, many missing values are possible. :Observed indices :Missing value Overwrite observations to obtain . Do NMF on to obtain rank- matrix . Initialize missing values and obtain .
  86. Issues in EM-WNMF Rank needs to be chosen appropriately. Algorithm

    has initial value dependence. Slow convergence. [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. The space where each point is a matrix :Hyper-parameter Rank- matrices Low-rank matrix completion, EM-WNMF[4]
  87. Further topics of matrix factorization ・ Probabilistic Principal Component Analysis

    128 1×1 = 1 1+1 = 1 Decomposition under the algebra with [18] Hashemi, Soheil, Hokchhay Tann, and Sherief Reda. "BLASYS: Approximate logic synthesis using Boolean matrix factorization." Proceedings of the 55th Annual Design Automation Conference. 2018. Confidence intervals can be obtained. Each element of the matrix is sampled from a probability distribution. We assume a prior distribution. Input Output ・ Binary Matrix Factorization ,Boolean matrix factorization ・Autoencoder and Principal Component Analysis The error function of a linear autoencoder with parameter sharing is equivalent to PCA. The activation is given as f(x) = x. The width of the middle layer corresponds to the rank. Matrix rank 𝐖 𝐖⊤ often used in recommendation engine
  88. Further topics of matrix factorization ・Deep matrix factorization [19] Fan,

    Jicong. "Multi-mode deep matrix and tensor factorization." ICLR, 2021. Activation function ・ Double-descent phenomena in NMF 129 ≃ 𝐗 ∈ ℝ𝑛×𝑚 𝐖 ∈ ℝ𝑛×𝑟 𝐇 ∈ ℝ𝑟×𝑚 [20] Kawakami, Y., et.al.,“Investigating Overparameterization for Non-Negative Matrix Factorization in Collaborative Filtering” RecSys 2021, Late-Breaking Results track, 2021. → Larger rank 𝑟 Over fitting
  89. 130 Overview ▪ Introduction: Why do we decompose data? ▪

    A quick review of matrix rank ▪ Singular value decomposition (SVD) and low-rank approximation ▪ Kernel subspace method and its applications (denoising, anomaly detection) ▪ Non-negative matrix factorization ▪ Tensor low-rank decomposition and many-body approximation
  90. Various data are stored in computers as tensors. 131 NASA,

    https://sdo.gsfc.nasa.gov/data/ Microarray data RGB Image Hyperspectral Image EEG Data = [14] M.Mørup, Data Mining and Knowledge Discovery 2011 [16] A.Cichocki, et al. Nonnegative Matrix and Tensor Factorizations 2009 [16] A.Cichocki, et al. Nonnegative Matrix and Tensor Factorizations 2009 Video Time series data, Signals [16] A.Cichocki, et al. Nonnegative Matrix and Tensor Factorizations 2009 [15] Kosmas Dimitropoulos, et al. Transactions on circuits and systems for video technology 2018
  91. Tensors in modern machine learning 132 Relational Learning A B

    C E D F G ? [7] e.g., Graph link prediction by RESCAL Model Express by tensors [8] Deep Learning e.g., Lightweighting Neural Networks with Tensor Decomposition [9] [7] Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel. “A three-way model for collective learning on multi-relational data.” ICML2011 [8] T. Bezdan, N. Bačanin Džakula, International Scientific Conference on Information Technology and Data Related Research, 2019 [9] Y. Liu et al., Tensor Computation for data analysis, 2022
  92. Tensor Hierarchy: Vectors of a Tensor are also Tensors Numbers

    Vectors [First-order tensors] Matrices [Second-order tensors] 𝑎 𝒂 = 𝑎1 𝑎2 ⋮ 𝑎𝑛 𝐗 = 𝑎11 ⋯ 𝑎1𝑛 ⋮ ⋱ ⋮ 𝑎𝑛1 ⋯ 𝑎𝑛𝑛 Third-order tensors ⋯ 𝒯= Fourth-order tensors 𝑈= ⋯ ⋯ 𝒯 1 𝒯 2 𝒯 𝑛 133
  93. Density estimation, pattern extraction, data mining, denoising, data compression, anomaly

    detection… Extracting features from tensors. 134 Who Shop ≃ = ത 𝒫 minmize 𝒫 − ത 𝒫 𝐹 + ⋯ + 𝒫 We want to extract pattern or feature in tensor formatted data ▪ Tensor decomposition extracts features from data. mode-2 mode-1
  94. Difficulties in tensor factorization ▪ We need to decide both

    of a structure and a rank. CP decomp. ≃ = 𝒫 ത 𝒫 ≃ Tucker decomp. Decomp. with tensor networks + ⋯ + minmize 𝒫 − ത 𝒫 𝐹 Tensor Train decomposion Tensor ring decomposion 𝒫 Tensor networks Each node is a tensor, Each edge is a mode [11] Wang, Wenqi, et al. CVPR. 2018. [12] Zheng, Yu-Bang, et al. AAAI 2021. https://tensornetwork.org [13] A. Cichocki, et al., Tensor Networks for Dimensionality Reduction and Large-scale Optimization, 2016
  95. Difficulties in tensor factorization 136 ▪ We need to decide

    both of a structure and a rank. CP decomp. ≃ = 𝒫 ത 𝒫 ▪ Optimization is difficult. Tensor decomposition is often ill-posed or NP-hard. ・ The best rank-1 CP-decomp. for minimizing Frobenius norm is NP-hard. ≃ Tucker decomp. Decomp. with tensor networks + ⋯ + A convex, stable and intuitive designable tensor decomposition is desired. minmize 𝒫 − ത 𝒫 𝐹 Non-convex 𝒫 − ത 𝒫 𝐹 Solution often might be indeterminate. The objective function is typically non-convex. ・Initial values dependency No guarantee to be the best solution. Tensor Train decomposion Tensor ring decomposion 𝒫
  96. Many-body approximation for non-negative tensors[10] 139 Control relation between mode-k

    and mode-l. Control relation among mode-j, -k and -l. Natural parameter of exponential distribution family. Energy function
  97. Many-body approximation for non-negative tensors[10] 141 One-body approx. Rank-1 approximation

    (mean-field approximation) Two-body approx. Control relation between mode-k and mode-l. Larger Capability Two-body Interaction
  98. Many-body approximation for non-negative tensors[10] 142 One-body approx. Rank-1 approximation

    (mean-field approximation) Two-body approx. Two-body Interaction Three-body approx. Larger Capability The global optimal solution minimizing KL divergence from can be obtained by a convex optimization. Intuitive modeling focusing on interactions between modes Control relation among mode-j, -k and -l. Control relation between mode-k and mode-l. Three-body Interaction
  99. We regard a normalized tensor as a discrete joint probability

    distribution whose sample space is an index set Coordinates transformation Theoretical idea behind many-body approx. 143 Index is discrete random variable 𝜽- Representation Representation with natural parameters of exponential family Geometry of 𝜽-space Low-rank space (Not flat) Flat subspace Difficult to optimize We construct flat subspace by focusing interaction among tensor modes Describing tensor factorization in 𝜽-coordinate system makes it convex problem We use information geometry to formulate factorization as convex problem
  100. Example tensor reconstruction by many-body approx. Larger capability Reconstruction for

    40×40×3×10 tensor. (width, height, colors, # images) Color depends on image index Shape of each image Color is uniform within each image. Intuitive model designable that captures the relationship between modes Color depends on pixel Three-body Approx. 152
  101. Color image is decomposed into shape × color × ≃

    = Shape of each image Color of each image
  102. Data completion by many-body approx. and the em-algorithm m-step 𝐏

    ← low_rank_approx(𝐏) e-step 𝐏𝜏 ← 𝐓𝜏 𝜏 : observed indices Low-rank Tensor Completion 𝑒-projection 𝑚-projection Not flat. It causes non-uniquness of projection. Which structure is better for decomposition? How to chose rank? Low-rank space 155
  103. Data completion by many-body approx. and the em-algorithm m-step 𝐏

    ← low_rank_approx(𝐏) e-step 𝐏𝜏 ← 𝐓𝜏 𝜏 : observed indices Low-rank Tensor Completion 𝑒-projection Low-body many_body_approx Low-body space is flat. It ensures uniquness of the projection Intuitive modeling is possible! 𝑚-projection 156
  104. Data completion by many-body approx. and the em-algorithm Missing area

    GT Prediction Interaction above correlated modes Reconstructing traffic speed data of 28×24×12×4 tensor with missing values. (days, hours, min, lanes) 157
  105. Data completion by many-body approx. and the em-algorithm Missing area

    GT Prediction Interaction above correlated modes Interaction above non-correlated modes Fit score: 0.82 Reconstructing traffic speed data of 28×24×12×4 tensor with missing values. (days, hours, min, lanes) 158
  106. 159 Summary: Many-body approximation for tensors https://arxiv.org/abs/2209.15338 ・More intuitive design

    than rank tuning One-body Approx. Three-body Approx. Two-body Approx. Many-body Approximation Bias (magnetic field) Weight (electron-electron interaction) ・Convex optimization always provide unique solution
  107. 160 Report assignments 1. Prove the Eckart-Young or Eckart-Young-Mirsky Theorem.

    2. There are many algorithms for Non-negative Matrix Factorization, e.g., HALS-NMF, MU-NMF, etc. Choose one of them and derive the algorithm. You do not have to show the convergence guarantee. There are four assignments below. Choose one of them and write your report. Please indicate which assignment you have chosen. The format is flexible but should be less than 2 sheets of A4 size paper. Handwriting is also acceptable. Note: There are many variation of NMF, such as: - NMF optimizing Frobenius norm - NMF optimizing KL divergence - NMF optimizing Frobenius norm with regularization term. Please indicate which kind of NMF you are discussing.
  108. 161 Report assignments 3. Visit Google Scholar (https://scholar.google.com/) and search

    for a paper related to SVD, NMF, or tensor factorization. Summarize the motivation, method, and novelty of the paper. Note: Please clearly indicate the paper title you chose. Example papers: - Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." nature 401.6755 (1999): 788-791. - Takeuchi, Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. - Lee, Hyekyoung, and Seungjin Choi. "Group nonnegative matrix factorization for EEG classification." Artificial Intelligence and Statistics. PMLR, 2009. - Ding, Chris, Xiaofeng He, and Horst D. Simon. "On the equivalence of nonnegative matrix factorization and spectral clustering." Proceedings of the 2005 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2005. - Lee, Daniel, and H. Sebastian Seung. "Algorithms for non-negative matrix factorization." Advances in neural information processing systems 13 (2000). - Zhang, Jun, Yong Yan, and Martin Lades. "Face recognition: eigenface, elastic matching, and neural nets." Proceedings of the IEEE 85.9 (1997): 1423-1435. Recommended keywords: - Non-negative matrix factorization, SVD, Tensor decomposition, Eigen face, weighted NMF, low-rank completion
  109. 162 Report assignments 4. Follow the steps below using Python,

    Matlab, R, C/C++, Fortran, or Julia: (1) Convert a grayscale image into a two-dimensional array (matrix). (2) Perform the rank-r approximation on it using SVD. (3) Plot a figure with rank on the horizontal axis and reconstruction error on the vertical axis. Observe how the reconstructed image changes as the rank is varied. (4) Plot a figure of the singular value distribution obtained by SVD for the image. The vertical axis of the plot is the nth largest singular value, and the horizontal axis is n. Rank Reconstruction Error n n-th singular value Note: Please paste a part of the source code you have written into the report. You do not have to paste all source code.
  110. 163 Overview ▪ Introduction: Why do we decompose data? ▪

    A quick review of matrix rank ▪ Singular value decomposition (SVD) and low-rank approximation ▪ Kernel subspace method and its applications (denoising, anomaly detection) ▪ Non-negative matrix factorization ▪ Tensor low-rank decomposition and many-body approximation
  111. For further study グラ フ ィ カ ルユーザーインタ ーフ ェ

    イス 自動的に生成さ れた説明 テキストが含まれている画像 自動的に生成さ れた説明 ▪ Textbooks for Matrix and Tensor Decomposition A series lecture by Dr. Steve Brunton ・Speeding up SVD ・SVD for linear regression ・SVD for face recognition ・How to chose ranks ▪ Tensor libraries 時計 , メ ータ ー , ボール , 記号 が含まれている画像 自動的に生成さ れた説明 ロゴ 自動的に生成さ れた説明 箱ひげ図 が含まれている画像 自動的に生成さ れた説明 [iTensor] カ レンダー 中程度の精度で自動的に生成さ れた説明 … 166
  112. 167 References [1] Mika, Sebastian, et al. "Kernel PCA and

    de-noising in feature spaces." Advances in neural information processing systems 11 (1998). [2] Hoffmann, Heiko. "Kernel PCA for novelty detection." Pattern recognition 40.3 (2007): 863-874. [3] Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788-791. [4] Zhang, Sheng, et al. "Learning from incomplete ratings using non-negative matrix factorization.“ Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006. [5] Lee, Hyekyoung, and Seungjin Choi. "Group nonnegative matrix factorization for EEG classification.“ Artificial Intelligence and Statistics. PMLR, 2009. [6] Takeuchi Koh, et al. "Non-negative multiple matrix factorization." Twenty-third international joint conference on artificial intelligence. 2013. [7] Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel. “A three-way model for collective learning on multi-relational data.” ICML2011 [8] T. Bezdan, N. Bačanin Džakula, International Scientific Conference on Information Technology and Data Related Research, 2019 [9] Y. Liu et al., "Tensor Computation for data analysis", 2022 [10] Ghalamkari Kazu, Mahito Sugiyama, and Yoshinobu Kawahara. "Many-body approximation for non-negative tensors.“ Advances in Neural Information Processing Systems 36 (2024). [11] Wang, Wenqi, et al. "Wide compression: Tensor ring nets." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [12] Zheng, Yu-Bang, et al. "Fully-connected tensor network decomposition and its application to higher-order tensor completion." Proceedings of the AAAI conference on artificial intelligence. Vol. 35. No. 12. 2021. [13] Cichocki, Andrzej, et al. "Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions." Foundations and Trends® in Machine Learning 9.4-5 (2016): 249-429.
  113. 168 References [14] Mørup, Morten. "Applications of tensor (multiway array)

    factorizations and decompositions in data mining." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.1 (2011): 24-40. [15] Dimitropoulos, Kosmas, et al. "Classification of multidimensional time-evolving data using histograms of grassmannian points." IEEE Transactions on Circuits and Systems for Video Technology 28.4 (2016): 892-905. [16] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. Non-negative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009. [17] Hamm, Jihun, and Daniel D. Lee. "Grassmann discriminant analysis: a unifying view on subspace-based learning." ICML. 2008. [18] Hashemi, Soheil, Hokchhay Tann, and Sherief Reda. “BLASYS: Approximate logic synthesis using Boolean matrix factorization.” Proceedings of the 55th Annual Design Automation Conference. 2018. [19] Fan, Jicong. "Multi-mode deep matrix and tensor factorization." ICLR, 2021. [20] Kawakami, Y., et.al.,“Investigating Overparameterization for Non-Negative Matrix Factorization in Collaborative Filtering” RecSys 2021, Late-Breaking Results track, 2021. Acknowledgements I would like to express my gratitude to Dr. Profir-Petru Pârțachi from the National Institute of Informatics for his valuable comments on the structure and content of this lecture slides. Used datasets MNIST https://yann.lecun.com/exdb/mnist/ COIL100 https://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php IRIS https://archive.ics.uci.edu/dataset/53/iris Yale Face Database http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html