Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Geometry of Operator Scaling PART I...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Tasuku Soma Tasuku Soma
June 17, 2020
2.4k

Information Geometry of Operator Scaling PART I: survey of scaling problems

Avatar for Tasuku Soma

Tasuku Soma

June 17, 2020
Tweet

Transcript

  1. Information Geometry of Operator Scaling PART I: survey of scaling

    problems Π Π k k+ Takeru Matsuda (UTokyo, RIKEN) Tasuku Soma (UTokyo) June , /
  2. Talk overview Part I: survey of scaling problems • What

    is scaling problem? • Matrix scaling and operator scaling with applications Part II: information geometry and scaling • My recent work (with Takeru Matsuda) • Operator scaling and quantum information geometry /
  3. Summary of Part I • Scaling problems are linear algeraic

    problems arising in surprisingly many fields. • Matrix scaling: optimal transport, statistics, machine learning, combinatorial optimization, ... • Operator scaling: combinatorial optimization, computational complexity, noncommutative algebra, analysis, ... • Both can be solved by simple alternating algorithm (Sinkhorn) /
  4. Matrix scaling Input: nonnegative matrix A ∈ Rm×n + Output:

    positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (LAR) n = m m and (LAR) m = n n Applications • Markov chain estimation [Richard Sinkhorn, 6 ] • Contingency table analysis [Morioka and Tsuda, ] • Optimal transport [Peyré and Cuturi, ] /
  5. Sinkhorn theorem Theorem (Richard Sinkhorn ( 6 )) For positive

    matrix A ∈ Rm×n, there exists a solution (L, R) for matrix scaling. Can we find it by efficient algorithm? 6 /
  6. Sinkhorn algorithm [Richard Sinkhorn, 6 ] W.l.o.g. assume that A

    m = n/n. A( ) = A A( k+ ) = m Diag(A( k) n)− A( k), A( k+ ) = n A( k+ ) Diag((A( k+ ) m)− . Theorem (Richard Sinkhorn ( 6 )) If A is a positive matrix, then there exists a solution and A(k) converges to a solution. /
  7. Example A( ) = . . . . . .

    . . . . . . . 8 . . . 6 . . . . 6 . 6 . . . 8 /
  8. Example A( ) = . . . 6 . 8

    . . 86 . . 8 . . . 6 . . . . . . 6 . . . . 6 . . 66 . 8 /
  9. Example A( ) = . . . . . .

    8 . . . 8 . . 6 . . 6 . . . 8 6 . . 6 . 88 . . 6 8 . . . 8 /
  10. Example A( ) = . . 6 . 6 .

    . . 8 . . . . . . . 6 . 6 8 . . 8 . . 6 . 886 . . 6 . . . 8 /
  11. Example A( 8) = . . . . . .

    . . 6 . . . . 6 . 6 . 68 . . 8 6 . . . 88 . . 6 8 . 8 . 8 . 8 /
  12. Matrix scaling: History s Kruithof telephone forecast s Deming, Stephen

    statistcs 6 s Stone economics (RAS method) Sinkhorn formulation of “matrix scaling” Knopp Sinkhorn algorithm s Csiszár information theory s Wigderson et al. computer science s Cuturi machine learning /
  13. Application: Transportation problem / / / / / / /6

    /6 /6 /6 / cost = (C + C + C + C ) · 6 + C · . /
  14. Application: Transportation problem / / / / / / /6

    /6 /6 /6 / cost = (C + C + C + C ) · 6 + C · . Find min cost transportation /
  15. Application: Transportation problem C ∈ Rm×n + : cost matrix

    min P≥O C, P s.t. P n = m m , P m = n n /m /n /m /n /m /n Cij i j /
  16. Application: Transportation problem C ∈ Rm×n + : cost matrix

    min P≥O C, P s.t. P n = m m , P m = n n /m /n /m /n /m /n Cij i j Entropic regularization [Wilson, 6 ] min P≥O C, P − i,j Pij log(Pij) s.t. P n = m m , P m = n n −→ The optimal solution is unique and in the form of P = LAR, where L, R: nonnegative diagonal and Aij = exp(−Cij/ ) matrix scaling! • Heavily used in ML (Wasserstein distance) [Peyré and Cuturi, ] /
  17. Approximate scaling Input: nonnegative matrix A ∈ Rm×n + ,

    > Find: positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (L AR ) n − m m < and (L AR ) m − n n < • A is approx scalable def ⇐⇒ has solution for ∀ > . /
  18. Approximate scaling and bipartite matching A =   

       . . . .8 . .       a b c support graph a b c perfect matching Theorem (R. Sinkhorn and Knopp ( 6 )) A is approx scalable ⇐⇒ support graph has perfect mathing Linear algebra Graph theory /
  19. Operator scaling Noncommutative/quantum generalization of matrix scaling. Applications • Edmonds

    problem [Gurvits, ; Ivanyos, Qiao, and Subrahmanyam, ; Ivanyos, Qiao, and Subrahmanyam, 8; Garg et al., ] • Brascamp–Lieb inequalities [Garg et al., 8] • Quantum Schrödinger bridge [Georgiou and Pavon, ] • Multivariate scatter estimation [Franks and Moitra, ] • Computational invariant theory [Allen-Zhu et al., 8]. /
  20. Operator scaling • A linear map Φ : Cn×n →

    Cm×m is completely positive (CP) if Φ(X) = k i= AiXA† i for some A , ... , Ak ∈ Cm×n. • The dual map of the above CP map is Φ∗(X) = k i= A† i XAi • For nonsingular Hermitian matrices L, R, the scaled map ΦL,R is ΦL,R(X) = LΦ(R†XR)L† 6 /
  21. Operator scaling Input: CP map Φ : Cn×n → Cm×m

    Output: nonsingular Hermitian matrices L, R s.t. ΦL,R(In) = Im m and Φ∗ L,R (Im) = In n . Note: Changed constants from Gurvits’ original “doubly stochastic” formulation /
  22. Approximate scaling Input: CP map Φ : Cn×n → Cm×m,

    > Find: nonsingular Hermitian matrices L , R s.t. ΦL ,R (In) − Im m < and Φ∗ L ,R (Im) − In n < . • Φ is approx scalable def ⇐⇒ has solution for ∀ > . 8 /
  23. Matrix scaling ⊆ Operator scaling For A ∈ Rm×n +

    , define Aij = √ aij ei e† j for i, j and CP map Φ(X) = i,j AijXA† ij = i,j aij ei e† j Xej e† i . Then, Φ(I) = i,j aij ei e† j ej e† i = i j aij ei e† i = Diag(A n) Φ∗(I) = i,j aij ej e† i ei e† j = j i aij ej e† j = Diag(A m) /
  24. Operator Sinkhorn algorithm [Gurvits, ] W.l.o.g. assume that Φ∗(Im) =

    In/n. Φ( ) = Φ Φ( k+ ) = Φ( k) L,In where L = √ m Φ( k) (In)− / , Φ( k+ ) = Φ( k+ ) Im,R where R = √ n (Φ( k+ ))∗(Im)− / . Theorem (Gurvits ( )) If Φ is approximately scalable, Φ(k) convergences to a solution. /
  25. Application: Edmonds problem Given: A = x A + ·

    · · + xk Ak (A , ... , Ak : matrices, x , ... , xk : scalar variables) Determine: det(A) = ? • If one can use randomness, it is easy! Can we do deterministically? −→ P vs BPP (major open problem in complexity theory) • Deep connection to combinatorial optimization and complexity theory [Edmonds, 6 ; Lovász, 8 ; Murota, ] • For some wide class of A (and over C), one can solve it by operator scaling [Gurvits, ] /
  26. Application: Edmonds problem A =     

     x x x x x x       a b c support graph a b c perfect matching Theorem (Edmonds ( 6 )) Assume that A , ... , Ak have only one nonzero entries. Then, det(A) ≠ ⇐⇒ support graph has perfect mathing Linear algebra Graph theory /
  27. Application: Edmonds problem Theorem (Lovász’s weak duality) For A =

    x A + · · · + xk Ak , rank A ≤ min {n − dim U + dim( i AiU) : U ≤ Cn subspace} • If A , ... , Ak are rank-one, the equality holds. Furthermore, one can compute rank A and minimizer U by linear matroid intersection. • Lovász ( 8 ) also gave several families of A s.t. one can compute rank A via combinatorial optimization. Linear algebra Combinatorial optimization /
  28. Application: Edmonds problem Theorem (Gurvits ( )) For A =

    x A + · · · + xk Ak , consider CP map Φ(X) = k i= Ak XA† k . Then, [∃X O s.t. rank Φ(X) < rank X] =⇒ det A = Furtheremore, the former condition can be efficiently checked by operator scaling. • Gurvits ( ) gave several classes of A s.t. the converse is also true. /
  29. Noncommutative rank Lovász’s weak duality has deep connections to algebra...

    • noncommutative rank (nc-rank) is the rank of A = A x + · · · + Ak xk where xi are noncommutative variables (i.e., xixj ≠ xjxi) • Highly nontrivial to define! (needs deep algebra, e.g. free-skew fields, matrix ring, ...) Theorem (Amitsur ( 66), Cohn ( ), Gurvits ( ), and Garg et al. ( )) nc-rank of A = min{n − dim U + dim( i AiU) : U ≤ Cn subspace} Furthermore, nc-rank can be computed via operator scaling. Noncommutative Edmonds problem is in P!! /
  30. Application: Brascamp-Lieb inequalities B = (B , ... , Bk

    ): tuple of matrices, Bi: ni × n, p = (p , ... , pk ) ∈ Rk ++ Theorem (Brascamp and Lieb ( 6)) There exists C ∈ ( , +∞] s.t. ∫ Rn k i= (fi(Bix))pi dx ≤ C · k i= ∫ Rni (fi(Bixi))dxi pi for all integrable nonnegative functions f , ... , fk . Includes Hölder, Loomis–Whitney, etc. 6 /
  31. BL datum and operator scaling Given B = (B ,

    ... , Bk ) and p, let BL(B, p) be the smallest C s.t. BL-inequality holds. Theorem (Lieb ( )) BL(B, p) = sup X ,...,Xk O k i= det(Xi) det( m i= piB† i XiBi) / BL(B, p) can be computed by operator scaling [Garg et al., 8] /
  32. Summary of Part I • Scaling problems are linear algeraic

    problems arising in surprisingly many fields. • Matrix scaling: optimal transport, statistics, machine learning, combinatorial optimization, ... • Operator scaling: combinatorial optimization, computational complexity, noncommutative algebra, analysis, ... • Both can be solved by simple alternating algorithm (Sinkhorn) /