Information Geometry of Operator Scaling PART I: survey of scaling problems

Information Geometry of Operator Scaling PART I: survey of scaling
problems Π Π k k+ Takeru Matsuda (UTokyo, RIKEN) Tasuku Soma (UTokyo) June , /

Talk overview Part I: survey of scaling problems • What
is scaling problem? • Matrix scaling and operator scaling with applications Part II: information geometry and scaling • My recent work (with Takeru Matsuda) • Operator scaling and quantum information geometry /

Summary of Part I • Scaling problems are linear algeraic
problems arising in surprisingly many ﬁelds. • Matrix scaling: optimal transport, statistics, machine learning, combinatorial optimization, ... • Operator scaling: combinatorial optimization, computational complexity, noncommutative algebra, analysis, ... • Both can be solved by simple alternating algorithm (Sinkhorn) /

Matrix scaling /

Matrix scaling Input: nonnegative matrix A ∈ Rm×n + Output:
positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (LAR) n = m m and (LAR) m = n n Applications • Markov chain estimation [Richard Sinkhorn, 6 ] • Contingency table analysis [Morioka and Tsuda, ] • Optimal transport [Peyré and Cuturi, ] /

Sinkhorn theorem Theorem (Richard Sinkhorn ( 6 )) For positive
matrix A ∈ Rm×n, there exists a solution (L, R) for matrix scaling. Can we ﬁnd it by efﬁcient algorithm? 6 /

Sinkhorn algorithm [Richard Sinkhorn, 6 ] W.l.o.g. assume that A
m = n/n. A( ) = A A( k+ ) = m Diag(A( k) n)− A( k), A( k+ ) = n A( k+ ) Diag((A( k+ ) m)− . Theorem (Richard Sinkhorn ( 6 )) If A is a positive matrix, then there exists a solution and A(k) converges to a solution. /

Example A( ) = . . . . . .
. . . . . . . 8 . . . 6 . . . . 6 . 6 . . . 8 /

Example A( ) = . . . 6 . 8
. . 86 . . 8 . . . 6 . . . . . . 6 . . . . 6 . . 66 . 8 /

Example A( ) = . . . . . .
8 . . . 8 . . 6 . . 6 . . . 8 6 . . 6 . 88 . . 6 8 . . . 8 /

Example A( ) = . . 6 . 6 .
. . 8 . . . . . . . 6 . 6 8 . . 8 . . 6 . 886 . . 6 . . . 8 /

Example A( 8) = . . . . . .
. . 6 . . . . 6 . 6 . 68 . . 8 6 . . . 88 . . 6 8 . 8 . 8 . 8 /

Matrix scaling: History s Kruithof telephone forecast s Deming, Stephen
statistcs 6 s Stone economics (RAS method) Sinkhorn formulation of “matrix scaling” Knopp Sinkhorn algorithm s Csiszár information theory s Wigderson et al. computer science s Cuturi machine learning /

Application: Transportation problem / / / / / / cost:
Cij i j /

Application: Transportation problem / / / / / / /6
/6 /6 /6 / cost = (C + C + C + C ) · 6 + C · . /

Application: Transportation problem / / / / / / /6
/6 /6 /6 / cost = (C + C + C + C ) · 6 + C · . Find min cost transportation /

Application: Transportation problem C ∈ Rm×n + : cost matrix
min P≥O C, P s.t. P n = m m , P m = n n /m /n /m /n /m /n Cij i j /

Application: Transportation problem C ∈ Rm×n + : cost matrix
min P≥O C, P s.t. P n = m m , P m = n n /m /n /m /n /m /n Cij i j Entropic regularization [Wilson, 6 ] min P≥O C, P − i,j Pij log(Pij) s.t. P n = m m , P m = n n −→ The optimal solution is unique and in the form of P = LAR, where L, R: nonnegative diagonal and Aij = exp(−Cij/ ) matrix scaling! • Heavily used in ML (Wasserstein distance) [Peyré and Cuturi, ] /

Approximate scaling Input: nonnegative matrix A ∈ Rm×n + ,
> Find: positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (L AR ) n − m m < and (L AR ) m − n n < • A is approx scalable def ⇐⇒ has solution for ∀ > . /

Approximate scaling and bipartite matching A =   
   . . . .8 . .       a b c support graph a b c perfect matching Theorem (R. Sinkhorn and Knopp ( 6 )) A is approx scalable ⇐⇒ support graph has perfect mathing Linear algebra Graph theory /

Operator scaling /

Operator scaling Noncommutative/quantum generalization of matrix scaling. Applications • Edmonds
problem [Gurvits, ; Ivanyos, Qiao, and Subrahmanyam, ; Ivanyos, Qiao, and Subrahmanyam, 8; Garg et al., ] • Brascamp–Lieb inequalities [Garg et al., 8] • Quantum Schrödinger bridge [Georgiou and Pavon, ] • Multivariate scatter estimation [Franks and Moitra, ] • Computational invariant theory [Allen-Zhu et al., 8]. /

Operator scaling • A linear map Φ : Cn×n →
Cm×m is completely positive (CP) if Φ(X) = k i= AiXA† i for some A , ... , Ak ∈ Cm×n. • The dual map of the above CP map is Φ∗(X) = k i= A† i XAi • For nonsingular Hermitian matrices L, R, the scaled map ΦL,R is ΦL,R(X) = LΦ(R†XR)L† 6 /

Operator scaling Input: CP map Φ : Cn×n → Cm×m
Output: nonsingular Hermitian matrices L, R s.t. ΦL,R(In) = Im m and Φ∗ L,R (Im) = In n . Note: Changed constants from Gurvits’ original “doubly stochastic” formulation /

Approximate scaling Input: CP map Φ : Cn×n → Cm×m,
> Find: nonsingular Hermitian matrices L , R s.t. ΦL ,R (In) − Im m < and Φ∗ L ,R (Im) − In n < . • Φ is approx scalable def ⇐⇒ has solution for ∀ > . 8 /

Matrix scaling ⊆ Operator scaling For A ∈ Rm×n +
, deﬁne Aij = √ aij ei e† j for i, j and CP map Φ(X) = i,j AijXA† ij = i,j aij ei e† j Xej e† i . Then, Φ(I) = i,j aij ei e† j ej e† i = i j aij ei e† i = Diag(A n) Φ∗(I) = i,j aij ej e† i ei e† j = j i aij ej e† j = Diag(A m) /

Operator Sinkhorn algorithm [Gurvits, ] W.l.o.g. assume that Φ∗(Im) =
In/n. Φ( ) = Φ Φ( k+ ) = Φ( k) L,In where L = √ m Φ( k) (In)− / , Φ( k+ ) = Φ( k+ ) Im,R where R = √ n (Φ( k+ ))∗(Im)− / . Theorem (Gurvits ( )) If Φ is approximately scalable, Φ(k) convergences to a solution. /

Application: Edmonds problem Given: A = x A + ·
· · + xk Ak (A , ... , Ak : matrices, x , ... , xk : scalar variables) Determine: det(A) = ? • If one can use randomness, it is easy! Can we do deterministically? −→ P vs BPP (major open problem in complexity theory) • Deep connection to combinatorial optimization and complexity theory [Edmonds, 6 ; Lovász, 8 ; Murota, ] • For some wide class of A (and over C), one can solve it by operator scaling [Gurvits, ] /

Application: Edmonds problem A =     
 x x x x x x       a b c support graph a b c perfect matching Theorem (Edmonds ( 6 )) Assume that A , ... , Ak have only one nonzero entries. Then, det(A) ≠ ⇐⇒ support graph has perfect mathing Linear algebra Graph theory /

Application: Edmonds problem Theorem (Lovász’s weak duality) For A =
x A + · · · + xk Ak , rank A ≤ min {n − dim U + dim( i AiU) : U ≤ Cn subspace} • If A , ... , Ak are rank-one, the equality holds. Furthermore, one can compute rank A and minimizer U by linear matroid intersection. • Lovász ( 8 ) also gave several families of A s.t. one can compute rank A via combinatorial optimization. Linear algebra Combinatorial optimization /

Application: Edmonds problem Theorem (Gurvits ( )) For A =
x A + · · · + xk Ak , consider CP map Φ(X) = k i= Ak XA† k . Then, [∃X O s.t. rank Φ(X) < rank X] =⇒ det A = Furtheremore, the former condition can be efﬁciently checked by operator scaling. • Gurvits ( ) gave several classes of A s.t. the converse is also true. /

Noncommutative rank Lovász’s weak duality has deep connections to algebra...
• noncommutative rank (nc-rank) is the rank of A = A x + · · · + Ak xk where xi are noncommutative variables (i.e., xixj ≠ xjxi) • Highly nontrivial to deﬁne! (needs deep algebra, e.g. free-skew ﬁelds, matrix ring, ...) Theorem (Amitsur ( 66), Cohn ( ), Gurvits ( ), and Garg et al. ( )) nc-rank of A = min{n − dim U + dim( i AiU) : U ≤ Cn subspace} Furthermore, nc-rank can be computed via operator scaling. Noncommutative Edmonds problem is in P!! /

Application: Brascamp-Lieb inequalities B = (B , ... , Bk
): tuple of matrices, Bi: ni × n, p = (p , ... , pk ) ∈ Rk ++ Theorem (Brascamp and Lieb ( 6)) There exists C ∈ ( , +∞] s.t. ∫ Rn k i= (fi(Bix))pi dx ≤ C · k i= ∫ Rni (fi(Bixi))dxi pi for all integrable nonnegative functions f , ... , fk . Includes Hölder, Loomis–Whitney, etc. 6 /

BL datum and operator scaling Given B = (B ,
... , Bk ) and p, let BL(B, p) be the smallest C s.t. BL-inequality holds. Theorem (Lieb ( )) BL(B, p) = sup X ,...,Xk O k i= det(Xi) det( m i= piB† i XiBi) / BL(B, p) can be computed by operator scaling [Garg et al., 8] /

Summary 8 /

Summary of Part I • Scaling problems are linear algeraic
problems arising in surprisingly many ﬁelds. • Matrix scaling: optimal transport, statistics, machine learning, combinatorial optimization, ... • Operator scaling: combinatorial optimization, computational complexity, noncommutative algebra, analysis, ... • Both can be solved by simple alternating algorithm (Sinkhorn) /

Information Geometry of Operator Scaling PART I...

Information Geometry of Operator Scaling PART I: survey of scaling problems

More Decks by Tasuku Soma

Featured

Transcript