Ghalamkari1,2, Mahito Sugiyama1,2 International Conference on Information Geometry for Data Science (IG4DS 2022) 1 : The Graduate University for Advanced Studies, SOKENDAI 2 : National Institute of Informatics

2 Approximates with a linear combination of fewer bases (principal components) for feature extraction, memory reduction, and pattern discovery.😀 ≃ ≃ ≃ ≃

3 Approximates with a linear combination of fewer bases (principal components) for feature extraction, memory reduction, and pattern discovery.😀 ≃ ≃ ≃ ≃ Non-negative constraint improves interpretability

4 Approximates with a linear combination of fewer bases (principal components) for feature extraction, memory reduction, and pattern discovery.😀 ≃ ≃ ≃ ≃ Low-rank approximation with non-negative constraints are based on gradient methods. → Appropriate settings for stopping criteria, learning rate, and initial values are necessary 😢 Non-negative constraint improves interpretability

initial values, stopping criterion and learning rate 😄 Information Geometric Analysis using Distributions on DAGs that Correspond to Data Structures Rank-1 = rank 1,1,1

rank-1 missing NMF No worries about initial values, stopping criterion and learning rate 😄 Solve the task as a coupled NMF. Find the most dominant factor rapidly. Missing value Rank-1 = rank 1,1,1 Information Geometric Analysis using Distributions on DAGs that Correspond to Data Structures

, 𝑠3 ∈ the following three properties are satisfied. (1) Reflexivity ∶ 𝑠1 ≤ 𝑠1 (2) Antisymmetry: 𝑠1 ≤ 𝑠2 , 𝑠2 ≤ 𝑠1 ⇒ 𝑠1 = 𝑠2 (3)Transitivity:𝑠1 ≤ 𝑠2 , 𝑠2 ≤ 𝑠3 ⇒ 𝑠1 ≤ 𝑠3 □ log-linear model on DAG We define the log-linear model on a DAG as a mapping 𝑝: → 0,1 ．Natural parameters 𝜽 describe the model. 𝜃-space Mahito Sugiyama, Hiroyuki Nakahara and Koji Tsuda "Tensor balancing on statistical manifold“(2017) ICML. 15 Log-linear model on Directed Acyclic Graph (DAG)

𝑠3 ∈ the following three properties are satisfied. □DAG(poset) (1) Reflexivity ∶ 𝑠1 ≤ 𝑠1 (2) Antisymmetry: 𝑠1 ≤ 𝑠2 , 𝑠2 ≤ 𝑠1 ⇒ 𝑠1 = 𝑠2 (3)Transitivity:𝑠1 ≤ 𝑠2 , 𝑠2 ≤ 𝑠3 ⇒ 𝑠1 ≤ 𝑠3 □ log-linear model on DAG We define the log-linear model on a DAG as a mapping 𝑝: → 0,1 ．Natural parameters 𝜽 describe the model. 𝜃-space 𝜂-space We can also describe the model by expectation parameters 𝜼 with Möbius function. Mahito Sugiyama, Hiroyuki Nakahara and Koji Tsuda "Tensor balancing on statistical manifold“(2017) ICML. 16 Log-linear model on Directed Acyclic Graph (DAG)

Probability values Relation between distribution and tensor Möbius inversion formula ： 𝑖, 𝑗, 𝑘 , indices of the tensor ： index set ： tensor values 𝒫𝑖𝑗𝑘

can find the projection destination by a gradient-method. But gradient-methods require Appropriate settings for stopping criteria, learning rate, and initial values 😢 Rank-1 subspace Rank-1 condition (𝜽-representation) Its all many-body 𝜃-parameters are 0. is e-flat. The projection is unique.

condition with the 𝜂-parameter. is e-flat. The projection is unique. One-body parameter Many-body parameter Rank-1 subspace Its all many-body 𝜃-parameters are 0. Rank-1 condition (𝜽-representation) We can find the projection destination by a gradient-method. But gradient-methods require Appropriate settings for stopping criteria, learning rate, and initial values 😢

the best rank-1 approximation 33 One-body parameter Many-body parameter Rank-1 condition (𝜼- representation) Rank-1 condition (𝜽-representation) Rank-1 subspace Its all many-body 𝜃-parameters are 0.

the best rank-1 approximation 34 One-body parameter Many-body parameter Möbius inversion formula = 𝜂𝑖11 𝜂1𝑗1 𝜂11𝑘 Rank-1 condition (𝜼- representation) Rank-1 condition (𝜽-representation) Rank-1 subspace All 𝜼-parameters after the projection are identified. Using inversion formula, we found the projection destination. Its all many-body 𝜃-parameters are 0.

given as which minimizes KL divergence from 𝒫. Best rank-1 tensor formula for minimizing KL divergence (𝑑 = 3 ) 35 Mean-field approximation and rank-1 approximation We reproduce the result in K.Huang, et al. "Kullback-Leibler principal component for tensors is not NP-hard." ACSSC 2017 9:00

given as which minimizes KL divergence from 𝒫. Best rank-1 tensor formula for minimizing KL divergence (𝑑 = 3 ) 36 By the way, Frobenius error minimization is NP-hard Mean-field approximation and rank-1 approximation We reproduce the result in K.Huang, et al. "Kullback-Leibler principal component for tensors is not NP-hard." ACSSC 2017

given as which minimizes KL divergence from 𝒫. A tensor with 𝑑 indices is a joint distribution with 𝑑 random variables. A vector with only 1 index is an independent distribution with only one random variable. Best rank-1 tensor formula for minimizing KL divergence (𝑑 = 3 ) 37 By the way, Frobenius error minimization is NP-hard Mean-field approximation and rank-1 approximation We reproduce the result in K.Huang, et al. "Kullback-Leibler principal component for tensors is not NP-hard." ACSSC 2017 Normalized vector depending on only 𝑖 Normalized vector depending on only 𝑗 Normalized vector depending on only 𝑘

given as which minimizes KL divergence from 𝒫. A tensor with 𝑑 indices is a joint distribution with 𝑑 random variables. A vector with only 1 index is an independent distribution with only one random variable. Rank-1 approximation approximates a joint distribution by a product of independent distributions. Best rank-1 tensor formula for minimizing KL divergence (𝑑 = 3 ) 38 By the way, Frobenius error minimization is NP-hard Mean-field approximation and rank-1 approximation We reproduce the result in K.Huang, et al. "Kullback-Leibler principal component for tensors is not NP-hard." ACSSC 2017 Mean-field approximation : a methodology in physics for reducing a many-body problem to a one-body problem. Normalized vector depending on only 𝑖 Normalized vector depending on only 𝑗 Normalized vector depending on only 𝑘

= 0 𝜃112 𝜃131 𝜃121 𝜃113 𝜃211 𝜃311 Expand the tensor by focusing on the 𝑚-th axis into a rectangular matrix 𝜃(𝑚) (mode-𝑚 expansion) rank 𝒫 = 1 ⟺ its all many−body 𝜃 parameters are 0 Rank-1 condition (𝜽-representation) 46

a bingo location. The shaded areas do not change their values in the projection. 55 Example: Reduce the rank of (8,8,3) tensor to (5,8,3) or less STEP2 : Replace the bingo part with the best rank-1 tensor.

best rank-1 approximation formula 56 Example: Reduce the rank of (8,8,3) tensor to (5,8,3) or less STEP2 : Replace the bingo part with the best rank-1 tensor. STEP1 : Choose a bingo location. 𝜃 is zero 𝜃 can be any

best rank-1 approximation formula 57 Example: Reduce the rank of (8,8,3) tensor to (5,8,3) or less STEP2 : Replace the bingo part with the best rank-1 tensor. STEP1 : Choose a bingo location. 𝜃 is zero 𝜃 can be any

best rank-1 approximation formula The best tensor is obtained in the specified bingo space. 😄 There is no guarantee that it is the best rank (5,8,3) approximation. 😢 58 Example: Reduce the rank of (8,8,3) tensor to (5,8,3) or less STEP2 : Replace the bingo part with the best rank-1 tensor. STEP1 : Choose a bingo location. 𝜃 is zero 𝜃 can be any

or less STEP2 : Replace the bingo part with the best rank-1 tensor. STEP1 : Choose a bingo location. 𝜃 is zero 𝜃 can be any The shaded areas do not change their values in the projection.

et al., 2013) 𝚽𝑖𝑗 = ቊ 0 1 If 𝐗𝑖𝑗 is missing otherwise Element-wise product Missing value □ Collect missing values in a corner of matrix to solve as coupled NMF Equivalent

simultaneously rank-1 decomposable. ⇔ It can be written as 𝒘 ⊗ 𝒉, 𝒂 ⊗ 𝒉, 𝒘 ⊗ 𝒃 . Its all two-body 𝜃-parameters are 0. Simultaneous Rank-1 𝜽-condition One-body parameter Two-body parameter

Simultaneous Rank-1 𝜼-condition Its all two-body 𝜃-parameters are 0. Simultaneous Rank-1 𝜽-condition 𝑿, 𝒀, 𝒁 is simultaneously rank-1 decomposable. ⇔ It can be written as 𝒘 ⊗ 𝒉, 𝒂 ⊗ 𝒉, 𝒘 ⊗ 𝒃 . One-body parameter Two-body parameter is e-flat. The projection is unique.

= 𝜂𝑖1 𝜂1𝑗 Simultaneous Rank-1 𝜼-condition Its all two-body 𝜃-parameters are 0. Simultaneous Rank-1 𝜽-condition 𝑿, 𝒀, 𝒁 is simultaneously rank-1 decomposable. ⇔ It can be written as 𝒘 ⊗ 𝒉, 𝒂 ⊗ 𝒉, 𝒘 ⊗ 𝒃 . One-body parameter Two-body parameter The m-projection does not change one-body η-parameter Shun-ichi Amari, Information Geometry and Its Applications, 2008, Theorem 11.6

= 𝜂𝑖1 𝜂1𝑗 Simultaneous Rank-1 𝜼-condition Its all two-body 𝜃-parameters are 0. Simultaneous Rank-1 𝜽-condition 𝑿, 𝒀, 𝒁 is simultaneously rank-1 decomposable. ⇔ It can be written as 𝒘 ⊗ 𝒉, 𝒂 ⊗ 𝒉, 𝒘 ⊗ 𝒃 . One-body parameter Two-body parameter The m-projection does not change one-body η-parameter Shun-ichi Amari, Information Geometry and Its Applications, 2008, Theorem 11.6 All 𝜼-parameters after the projection are identified. 19:20

81 Missing values tend to be in certain columns in some real datasets. ex) disconnected sensing device, optional answer field in questionnaire form 🙅 Missing values are evenly distributed in each row and column. 🙆 Missing are heavily distributed in certain rows and columns.

KL-WNMF - Relative runtime < 1 means A1GM is faster than KL-WNMF. - Relative error > 1 means worse reconstruction error of A1GM than KL-WNMF. - Increase rate is the ratio of # missing values after addition of missing values at step1. ×5 – 10 times faster! 83 Find the best solution Add missing values. Accuracy decreases.

1 If 𝐗𝑖𝑗 is missing otherwise □ The rank of weight matrix is 2 after adding missing values. □ Can we exactly solve rank-1 NMF if the rank(Φ) = 2? 𝚽 𝚽 𝐗 𝐗 rank 𝚽 = 2 rank 𝚽 = 2 rank 𝚽 = 2 86

approximation of extended NMMF Equivalent If rank(𝚽) ≦2, the matrix can be transformed into the form 𝚽𝑖𝑗 = ቊ 0 1 If 𝐗𝑖𝑗 is missing otherwise Permutation We can exactly solve rank-1 NMF with missing values by permutation if rank(𝚽) ≦2.

𝜂𝑖𝑗𝑘 = ҧ 𝜂𝑖11 ҧ 𝜂1𝑗1 ҧ 𝜂11𝑘 Rank-1 condition (𝜽-representation) All many body ҧ 𝜃𝑖𝑗𝑘 are 0 93 □ Closed Formula of the Best Rank-1 NMMF □ A1GM: Faster Rank-1 NMF with missing values Conclusion The best rank-1 approximation for NMMF Data structure DAG Infor-Geo 93