Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Geometry of Operator Scaling PART I...

Tasuku Soma
June 17, 2020
2k

Information Geometry of Operator Scaling PART I: survey of scaling problems

Tasuku Soma

June 17, 2020
Tweet

Transcript

  1. Information Geometry of Operator Scaling PART I: survey of scaling

    problems Π Π k k+ Takeru Matsuda (UTokyo, RIKEN) Tasuku Soma (UTokyo) June , /
  2. Talk overview Part I: survey of scaling problems • What

    is scaling problem? • Matrix scaling and operator scaling with applications Part II: information geometry and scaling • My recent work (with Takeru Matsuda) • Operator scaling and quantum information geometry /
  3. Summary of Part I • Scaling problems are linear algeraic

    problems arising in surprisingly many fields. • Matrix scaling: optimal transport, statistics, machine learning, combinatorial optimization, ... • Operator scaling: combinatorial optimization, computational complexity, noncommutative algebra, analysis, ... • Both can be solved by simple alternating algorithm (Sinkhorn) /
  4. Matrix scaling Input: nonnegative matrix A ∈ Rm×n + Output:

    positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (LAR) n = m m and (LAR) m = n n Applications • Markov chain estimation [Richard Sinkhorn, 6 ] • Contingency table analysis [Morioka and Tsuda, ] • Optimal transport [Peyré and Cuturi, ] /
  5. Sinkhorn theorem Theorem (Richard Sinkhorn ( 6 )) For positive

    matrix A ∈ Rm×n, there exists a solution (L, R) for matrix scaling. Can we find it by efficient algorithm? 6 /
  6. Sinkhorn algorithm [Richard Sinkhorn, 6 ] W.l.o.g. assume that A

    m = n/n. A( ) = A A( k+ ) = m Diag(A( k) n)− A( k), A( k+ ) = n A( k+ ) Diag((A( k+ ) m)− . Theorem (Richard Sinkhorn ( 6 )) If A is a positive matrix, then there exists a solution and A(k) converges to a solution. /
  7. Example A( ) = . . . . . .

    . . . . . . . 8 . . . 6 . . . . 6 . 6 . . . 8 /
  8. Example A( ) = . . . 6 . 8

    . . 86 . . 8 . . . 6 . . . . . . 6 . . . . 6 . . 66 . 8 /
  9. Example A( ) = . . . . . .

    8 . . . 8 . . 6 . . 6 . . . 8 6 . . 6 . 88 . . 6 8 . . . 8 /
  10. Example A( ) = . . 6 . 6 .

    . . 8 . . . . . . . 6 . 6 8 . . 8 . . 6 . 886 . . 6 . . . 8 /
  11. Example A( 8) = . . . . . .

    . . 6 . . . . 6 . 6 . 68 . . 8 6 . . . 88 . . 6 8 . 8 . 8 . 8 /
  12. Matrix scaling: History s Kruithof telephone forecast s Deming, Stephen

    statistcs 6 s Stone economics (RAS method) Sinkhorn formulation of “matrix scaling” Knopp Sinkhorn algorithm s Csiszár information theory s Wigderson et al. computer science s Cuturi machine learning /
  13. Application: Transportation problem / / / / / / /6

    /6 /6 /6 / cost = (C + C + C + C ) · 6 + C · . /
  14. Application: Transportation problem / / / / / / /6

    /6 /6 /6 / cost = (C + C + C + C ) · 6 + C · . Find min cost transportation /
  15. Application: Transportation problem C ∈ Rm×n + : cost matrix

    min P≥O C, P s.t. P n = m m , P m = n n /m /n /m /n /m /n Cij i j /
  16. Application: Transportation problem C ∈ Rm×n + : cost matrix

    min P≥O C, P s.t. P n = m m , P m = n n /m /n /m /n /m /n Cij i j Entropic regularization [Wilson, 6 ] min P≥O C, P − i,j Pij log(Pij) s.t. P n = m m , P m = n n −→ The optimal solution is unique and in the form of P = LAR, where L, R: nonnegative diagonal and Aij = exp(−Cij/ ) matrix scaling! • Heavily used in ML (Wasserstein distance) [Peyré and Cuturi, ] /
  17. Approximate scaling Input: nonnegative matrix A ∈ Rm×n + ,

    > Find: positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (L AR ) n − m m < and (L AR ) m − n n < • A is approx scalable def ⇐⇒ has solution for ∀ > . /
  18. Approximate scaling and bipartite matching A =   

       . . . .8 . .       a b c support graph a b c perfect matching Theorem (R. Sinkhorn and Knopp ( 6 )) A is approx scalable ⇐⇒ support graph has perfect mathing Linear algebra Graph theory /
  19. Operator scaling Noncommutative/quantum generalization of matrix scaling. Applications • Edmonds

    problem [Gurvits, ; Ivanyos, Qiao, and Subrahmanyam, ; Ivanyos, Qiao, and Subrahmanyam, 8; Garg et al., ] • Brascamp–Lieb inequalities [Garg et al., 8] • Quantum Schrödinger bridge [Georgiou and Pavon, ] • Multivariate scatter estimation [Franks and Moitra, ] • Computational invariant theory [Allen-Zhu et al., 8]. /
  20. Operator scaling • A linear map Φ : Cn×n →

    Cm×m is completely positive (CP) if Φ(X) = k i= AiXA† i for some A , ... , Ak ∈ Cm×n. • The dual map of the above CP map is Φ∗(X) = k i= A† i XAi • For nonsingular Hermitian matrices L, R, the scaled map ΦL,R is ΦL,R(X) = LΦ(R†XR)L† 6 /
  21. Operator scaling Input: CP map Φ : Cn×n → Cm×m

    Output: nonsingular Hermitian matrices L, R s.t. ΦL,R(In) = Im m and Φ∗ L,R (Im) = In n . Note: Changed constants from Gurvits’ original “doubly stochastic” formulation /
  22. Approximate scaling Input: CP map Φ : Cn×n → Cm×m,

    > Find: nonsingular Hermitian matrices L , R s.t. ΦL ,R (In) − Im m < and Φ∗ L ,R (Im) − In n < . • Φ is approx scalable def ⇐⇒ has solution for ∀ > . 8 /
  23. Matrix scaling ⊆ Operator scaling For A ∈ Rm×n +

    , define Aij = √ aij ei e† j for i, j and CP map Φ(X) = i,j AijXA† ij = i,j aij ei e† j Xej e† i . Then, Φ(I) = i,j aij ei e† j ej e† i = i j aij ei e† i = Diag(A n) Φ∗(I) = i,j aij ej e† i ei e† j = j i aij ej e† j = Diag(A m) /
  24. Operator Sinkhorn algorithm [Gurvits, ] W.l.o.g. assume that Φ∗(Im) =

    In/n. Φ( ) = Φ Φ( k+ ) = Φ( k) L,In where L = √ m Φ( k) (In)− / , Φ( k+ ) = Φ( k+ ) Im,R where R = √ n (Φ( k+ ))∗(Im)− / . Theorem (Gurvits ( )) If Φ is approximately scalable, Φ(k) convergences to a solution. /
  25. Application: Edmonds problem Given: A = x A + ·

    · · + xk Ak (A , ... , Ak : matrices, x , ... , xk : scalar variables) Determine: det(A) = ? • If one can use randomness, it is easy! Can we do deterministically? −→ P vs BPP (major open problem in complexity theory) • Deep connection to combinatorial optimization and complexity theory [Edmonds, 6 ; Lovász, 8 ; Murota, ] • For some wide class of A (and over C), one can solve it by operator scaling [Gurvits, ] /
  26. Application: Edmonds problem A =     

     x x x x x x       a b c support graph a b c perfect matching Theorem (Edmonds ( 6 )) Assume that A , ... , Ak have only one nonzero entries. Then, det(A) ≠ ⇐⇒ support graph has perfect mathing Linear algebra Graph theory /
  27. Application: Edmonds problem Theorem (Lovász’s weak duality) For A =

    x A + · · · + xk Ak , rank A ≤ min {n − dim U + dim( i AiU) : U ≤ Cn subspace} • If A , ... , Ak are rank-one, the equality holds. Furthermore, one can compute rank A and minimizer U by linear matroid intersection. • Lovász ( 8 ) also gave several families of A s.t. one can compute rank A via combinatorial optimization. Linear algebra Combinatorial optimization /
  28. Application: Edmonds problem Theorem (Gurvits ( )) For A =

    x A + · · · + xk Ak , consider CP map Φ(X) = k i= Ak XA† k . Then, [∃X O s.t. rank Φ(X) < rank X] =⇒ det A = Furtheremore, the former condition can be efficiently checked by operator scaling. • Gurvits ( ) gave several classes of A s.t. the converse is also true. /
  29. Noncommutative rank Lovász’s weak duality has deep connections to algebra...

    • noncommutative rank (nc-rank) is the rank of A = A x + · · · + Ak xk where xi are noncommutative variables (i.e., xixj ≠ xjxi) • Highly nontrivial to define! (needs deep algebra, e.g. free-skew fields, matrix ring, ...) Theorem (Amitsur ( 66), Cohn ( ), Gurvits ( ), and Garg et al. ( )) nc-rank of A = min{n − dim U + dim( i AiU) : U ≤ Cn subspace} Furthermore, nc-rank can be computed via operator scaling. Noncommutative Edmonds problem is in P!! /
  30. Application: Brascamp-Lieb inequalities B = (B , ... , Bk

    ): tuple of matrices, Bi: ni × n, p = (p , ... , pk ) ∈ Rk ++ Theorem (Brascamp and Lieb ( 6)) There exists C ∈ ( , +∞] s.t. ∫ Rn k i= (fi(Bix))pi dx ≤ C · k i= ∫ Rni (fi(Bixi))dxi pi for all integrable nonnegative functions f , ... , fk . Includes Hölder, Loomis–Whitney, etc. 6 /
  31. BL datum and operator scaling Given B = (B ,

    ... , Bk ) and p, let BL(B, p) be the smallest C s.t. BL-inequality holds. Theorem (Lieb ( )) BL(B, p) = sup X ,...,Xk O k i= det(Xi) det( m i= piB† i XiBi) / BL(B, p) can be computed by operator scaling [Garg et al., 8] /
  32. Summary of Part I • Scaling problems are linear algeraic

    problems arising in surprisingly many fields. • Matrix scaling: optimal transport, statistics, machine learning, combinatorial optimization, ... • Operator scaling: combinatorial optimization, computational complexity, noncommutative algebra, analysis, ... • Both can be solved by simple alternating algorithm (Sinkhorn) /