Information Geometry of Operator Scaling Part II: information geometry and scaling

Information Geometry of Operator Scaling Part II: information geomerty and
scaling Π Π k k+ Takeru Matsuda (UTokyo, RIKEN) Tasuku Soma (UTokyo) July 8, /

Overview /

Matrix scaling Input: nonnegative matrix A ∈ Rm×n + Output:
positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (LAR) n = m m and (LAR) m = n n Applications • Markov chain estimation [Sinkhorn, 6 ] • Contingency table analysis [Morioka and Tsuda, ] • Optimal transport [Peyré and Cuturi, ] /

Sinkhorn algorithm [Sinkhorn, 6 ] W.l.o.g. assume that A m
= n/n. A( ) = A A( k+ ) = m Diag(A( k) n)− A( k), A( k+ ) = n A( k+ ) Diag((A( k+ ) m)− . Theorem (Sinkhorn ( 6 )) If A is a positive matrix, then there exists a solution and A(k) converges to a solution. /

Sinkhorn = alternating e-projection Kullback-Leibler divergence DKL(B || A) =
i,j Bij log Bij Aij Theorem (Csiszár ( )) Sinkhorn’s iterate A(k) satisﬁes A( k+ ) = argmin{DKL(B || A( k)) : B n = /m}, A( k+ ) = argmin{DKL(B || A( k+ )) : B m = n/n}. In information geometry, this is alternating e-projection w.r.t. the Fisher metric. /

Operator scaling • A linear map Φ : Cn×n →
Cm×m is completely positive (CP) if Φ(X) = k i= AiXA† i for some A , ... , Ak ∈ Cm×n. • The dual map of the above CP map is Φ∗(X) = k i= A† i XAi • For nonsingular Hermitian matrices L, R, the scaled map ΦL,R is ΦL,R(X) = LΦ(R†XR)L† 6 /

Operator scaling Input: CP map Φ : Cn×n → Cm×m
Output: nonsingular Hermitian matrices L, R s.t. ΦL,R(In) = Im m and Φ∗ L,R (Im) = In n . Note: Changed constants from Gurvits’ original “doubly stochastic” formulation /

Operator Sinkhorn algorithm [Gurvits, ] W.l.o.g. assume that Φ∗(Im) =
In/n. Φ( ) = Φ Φ( k+ ) = Φ( k) L,In where L = √ m Φ( k) (In)− / , Φ( k+ ) = Φ( k+ ) Im,R where R = √ n (Φ( k+ ))∗(Im)− / . Under reasonable conditions, Φ(k) convergences to a solution [Gurvits, ] Can we view this as “alternating e-projection”? 8 /

Our result Theorem (Matsuda and S. ) Operator Sinkhorn is
alternating e-projection w.r.t. the symmetric logarithmic derivative (SLD) metric on positive deﬁnite matrices. • Quantum generalization of [Csiszár, ] /

Information geometry of matrix scaling /

Information geometry Statistical theory using differential geometry [Amari and Nagaoka,
] Nonmetrical dual connections play central role (i.e., different from Levi-Civita connection) Key components metric tensor g two afﬁne connections ∇(m), ∇(e) −→ induce two geodesics (m/e-geodesics) /

Information geometry on Sn− Sn− = p = (p ,
... , pn) : pk > , n k= pk = ⊂ Rn ++ Probability simplex of dim n − . We will introduce Riemmanian structure on Sn− with metric tensor g afﬁne connections ∇(m), ∇(e) /

Fisher metric Two coordinate systems on Sn− : m-coordinate (mixture):
= (p , ... , pn− ) e-coordinate (exponential): = (log(p /pn), ... , log(pn− /pn)) Fisher metric g(X, Y)p = i X(log pi)Y(pi) (X, Y ∈ Tp(Sn− ): tangent vectors) = i X(e) i Y(m) i where X = i X(e) i i p , Y(m) = i Y(m) i i p Note Fisher metric is the unique metric satisfying natural statistical invariance (Cencov’s theorem). /

Dual connections Take ∇(m), ∇(e) s.t. Christoffel symbols in m/e-coordinates
vanish, respectively. m-geodesic: (t) = ( − t)p + tq. e-geodesic: (t) ∝ exp(( − t) log p + t log q). p q m e They are dual connections: Zg(X, Y) = g(∇(m) Z X, Y) + g(X, ∇(e) Z Y) Cf. Levi-Civita connection ∇LC is self-dual: Zg(X, Y) = g(∇LC Z X, Y) + g(X, ∇LC Z Y) /

Sinkhorn as alternating e-projection Now, consider SN− (N = mn)
and submanifolds Π = {A ∈ Rm×n ++ | A n = m− m}, Π = {A ∈ Rm×n ++ | A m = n− n}. (These submanifolds are m-autoparallel) Π Π A( k) A( k+ ) Theorem If A ∈ Rm×n ++ , then iterates of Sinkhorn algorithm is e-projection: e-geodesic from A( k) to A( k+ ) (from A( k+ ) to A( k+ )) is orthogonal to Π (Π ) w.r.t. Fisher metric. /

Dually ﬂat structure of Sn− and KL-divergence Let Ψ(p) =
i pi log pi − pi be negative entropy. Then, = Ψ( ) , = Ψ∗( ) Legendre transform g = Hess(Ψ) Hessian One can deﬁne canonical divergence as D(p || q) = (p) − (q) − grad (q), q − p (in our case, it is KL) Fact E-projection onto m-autoparallel submanifolds can be done via canonical divergence minimization. −→ information geometric proof of Csiszár ( ) 6 /

KL-divergence and capacity Consider case of m = n Capacity
[Gurvits and Samorodnitsky, ; Idel, 6] cap(A) = inf x> n i= (Ax)i n i= xi /n Capacity can be used as “potential” for Sinkhorn. − log cap(A)+ log n = min B∈Π ∩Π DKL(B || A) /

Convergence of Sinkhorn algorithm Generalized Pythagorean theorem If the m-geodesic
from A to A and the e-geodesic from A to A are orthogonal at A w.r.t. the Fisher metric, then DKL(A || A ) = DKL(A || A ) + DKL(A || A ) A A A Theorem (Csiszár ( )) The Sinkhorn algorithm converges to the e-projection A∗ of A onto Π ∩ Π : DKL(A∗ || A) = min B∈Π ∩Π DKL(B || A) DKL(A∗ || A) = DKL(A∗ || A(K)) + K k= DKL(A(k) || A(k− )) 8 /

Quantum information geometry of operator scaling /

Information geometry of operator scaling Idea Using the Choi representation,
we move to manifold of PD matrices. Then apply quantum information geometry on the PD manifold. matrix scaling operator scaling manifold p ∈ RN ++ : pi > , pi = ∈ CN×N: O, tr = metric Fisher SLD divergence KL ??? dually ﬂat? YES NO /

Partial trace For a partitioned matrix    
  A · · · A n . . . ... . . . An · · · Ann       , partial traces are deﬁned as tr       A · · · A n . . . ... . . . An · · · Ann       = n i= Aii, tr       A · · · A n . . . ... . . . An · · · Ann       =       tr A · · · tr A n . . . ... . . . tr An · · · tr Ann       . /

Choi representation [Choi, ] CH(Φ) = n i,j= Eij ⊗
Φ(Eij) =       Φ(E ) · · · Φ(E n) . . . ... . . . Φ(En ) · · · Φ(Enn)       Facts: • CH(Φ) is isomorphism of linear maps and Hermitians • CH(Φ) O ⇐⇒ Φ is CP • CH(ΦL,R) = (R† ⊗ L) CH(Φ)(R ⊗ L†) • tr CH(Φ) = Φ(In), tr CH(Φ) = Φ∗(Im) /

Operator Sinkhorn in Choi representation We assume that CH(Φ) is
PD. Consider S(Cmn) = { ∈ Cmn×mn : O, tr = } (density matrices) and Π = { O | tr ( ) = I/m} ⊂ S(Cmn) Π = { O | tr ( ) = I/n} ⊂ S(Cmn) Putting k := CH(Φk ), iterates of operator Sinkhorn are: k+ = (I ⊗ Φ k (I)− / ) k (I ⊗ Φ k (I)− / ) ∈ Π k+ = (Φ∗ k+ (I)− / ⊗ I) k+ (Φ∗ k+ (I)− / ⊗ I) ∈ Π /

Symmetric Logarithmic Derivative (SLD) metric • In classical information geometry,
the Fisher metric is the only monotone metric (Cencov’s theorem). • However, in quantum information geometry, monotone metrics are not unique. • Monotone metrics are characterized by operator monotone functions [Petz, 6] • Each monotone metric induces its own e-connection. Symmetric Logarithmic Derivative (SLD) metric gS(X, Y) = tr(LS X Y ), where X = (LS X + LS X ) Lyapunov equation /

Operator Sinkhorn = alternating e-projection One can introduce m/e-connections s.t.
m-geodesic: (t) = ( − t) + t e-geodesic: (t) ∝ Kt Kt, where K = − # is matrix geometric mean Π Π k k+ Theorem If O, then iterates of operator Sinkhorn algorithm is the unique e-projection w.r.t. SLD metric: e-geodesic from k to k+ (from k+ to k+ ) is orthogonal to Π (Π ) w.r.t. SLD metric. /

Proof sketch • The e-geodesic from k to k+ is
(t) = Kt k Kt, K = − k # k+ = I ⊗ Φ k (I)− / • The e-representation LS of ( ) satisﬁes the Lyapunov equation: (LS k+ + k+ LS) = ( ) = (log K) k+ + k+ (log K) • Since the solution of the Lyapunov equation is unique, LS = log K = −I ⊗ log Φ k (I) • Therefore, ( ) is orthogonal to Π w.r.t. SLD metric. • Uniqueness is shown similarly (not from generalized Pythagorean theorem, but from the uniqueness of solutions of matrix equations) Π k k+ ( ) 6 /

Is there divergence for capacity? Consider case of m =
n. Capacity [Gurvits, ] cap(Φ) = inf X O det Φ(X) det X /n Key tool for studying operator scaling [Gurvits, ; Garg et al., ; Allen-Zhu et al., 8] Q. Is there a “divergence” D s.t. − log cap(Φ)+ log n = min ∗∈Π ∩Π D( ∗ || CH(Φ)) as in matrix scaling? /

Is there divergence for capacity? Naive idea: Umegaki relative entropy
D( || ) = tr[ (log − log )] • arises from dually ﬂat structure with Bogoliubov–Kubo–Mori metric, which corresponds to Ψ( ) = tr( log − ). However, numerical experiments shows this does not coincide with operator Sinkhorn iteration... Actually, SLD metric is NOT dually ﬂat! Still open! 8 /

Numerical Example Genereted random density matrix and compare • −
log cap( ) • D( ∗ || ) ( ∗: limit of Sinkhorn) 6 8 6 8 Umegaki relative entropy − log cap /

Summary /

Summary Operator Sinkhorn is alternating e-projection w.r.t. SLD metric on
PD matrices. Future work • Divergence characterization? • Convergence analysis based on SLD metric? Optimization algorithm? • Operator scaling applications for quantum information theory? /

Summary Operator Sinkhorn is alternating e-projection w.r.t. SLD metric on
PD matrices. Future work • Divergence characterization? • Convergence analysis based on SLD metric? Optimization algorithm? • Operator scaling applications for quantum information theory? Thank you! /

Matrix geometric mean For PD matrices A, B O, matrix
geometric mean is deﬁned as A#B = A / (A− / BA− / ) / A / . Properties • A#B O • A#B = B#A • A#B is the unique PD solution of algebraic Riccati equation XA− X = B. /

Information Geometry of Operator Scaling Part I...

Information Geometry of Operator Scaling Part II: information geometry and scaling

Tasuku Soma

More Decks by Tasuku Soma

Featured

Transcript

Information Geometry of Operator Scaling Part II: information geomerty and

Overview /

Matrix scaling Input: nonnegative matrix A ∈ Rm×n + Output:

Sinkhorn algorithm [Sinkhorn, 6 ] W.l.o.g. assume that A m

Sinkhorn = alternating e-projection Kullback-Leibler divergence DKL(B || A) =

Operator scaling • A linear map Φ : Cn×n →

Operator scaling Input: CP map Φ : Cn×n → Cm×m

Operator Sinkhorn algorithm [Gurvits, ] W.l.o.g. assume that Φ∗(Im) =

Our result Theorem (Matsuda and S. ) Operator Sinkhorn is

Information geometry of matrix scaling /

Information geometry Statistical theory using differential geometry [Amari and Nagaoka,

Information geometry on Sn− Sn− = p = (p ,

Fisher metric Two coordinate systems on Sn− : m-coordinate (mixture):

Dual connections Take ∇(m), ∇(e) s.t. Christoffel symbols in m/e-coordinates

Sinkhorn as alternating e-projection Now, consider SN− (N = mn)

Dually ﬂat structure of Sn− and KL-divergence Let Ψ(p) =

KL-divergence and capacity Consider case of m = n Capacity

Convergence of Sinkhorn algorithm Generalized Pythagorean theorem If the m-geodesic

Quantum information geometry of operator scaling /

Information geometry of operator scaling Idea Using the Choi representation,

Partial trace For a partitioned matrix    

Choi representation [Choi, ] CH(Φ) = n i,j= Eij ⊗

Operator Sinkhorn in Choi representation We assume that CH(Φ) is

Symmetric Logarithmic Derivative (SLD) metric • In classical information geometry,

Operator Sinkhorn = alternating e-projection One can introduce m/e-connections s.t.

Proof sketch • The e-geodesic from k to k+ is

Is there divergence for capacity? Consider case of m =

Is there divergence for capacity? Naive idea: Umegaki relative entropy

Numerical Example Genereted random density matrix and compare • −

Summary /

Summary Operator Sinkhorn is alternating e-projection w.r.t. SLD metric on

Summary Operator Sinkhorn is alternating e-projection w.r.t. SLD metric on

Matrix geometric mean For PD matrices A, B O, matrix