We formulate closed-form Hessian distances of information entropies in one-dimensional probability density space embedded with the L2-Wasserstein metric. Some analytical examples are provided.
consider a divergence function between them by D: R+ × R+ → R+ . Several examples are given below. Squared Euclidean distance: D(X Y ) = (X − Y )2; KullbackLeibler (KL) divergence: D(X Y ) = X log X Y ; Squared Hellinger distance: D(X Y ) = 4( √ X − √ Y )2. We briefly review them in L2 space. And we plan to build their counterparts in optimal transport (Wasserstein) space. 3
KL divergence: DKL (p q) = Ω p(x) log p(x) q(x) dx. KL divergence has a lot of properties. Nonsymmetry: DKL (p q) = DKL (q p); Separable; Convexity in both variables p and q. 4
the KL divergence. Observe that DKL (q + ˙ q q) = gq ( ˙ q, ˙ q) + o( ˙ q 2 L2 ), where the notation gq (h, h) = Ω | ˙ q(x)|2 q(x) dx, represents the Hessian operator of negative entropy Ω q(x) log q(x)dx, in L2 space. Here gq (·, ·) is a Hessian metric, also named Fisher-Rao-information metric. 5
transport the mountain with shape X, density q(x) to another shape Y with density p(y)? I.e. DistT (p, q)2 = inf T : Ω→Ω Ω T(x) − x 2q(x)dx: T# q = p . The problem was first introduced by Monge in 1781 and relaxed by Kantorovich in 1940. It introduces a metric function on probability set, named optimal transport distance, Wasserstein metric or Earth Mover’s distance (Ambrosio, Gangbo, McCann, Benamou, Breiner, Villani, Otto, Figali et.al.). Nowadays, optimal transport distances have been shown useful in inference problems and inverse problems (Poggio, Preye, Yunan, Engquist, Arjovsky, Osher, et.al.). 7
in Wasserstein space. Natural questions (i) What are Hessian distances in Wasserstein space? (ii) What is the “Hellinger” distance in Wasserstein space? Related studies Amari, Karakida, Oizumi, Cuturi; Guo, Hong, Yang; Leonard Wong, Yang, Zhang; Ay, Felice. 8
distance has the following closed form formulations. DistT (p, q)2 = Ω |T(x) − x|2q(x)dx, where T is a monotone mapping function such that p(T(x))T (x) = q(x). By some calculations, DistT (p, q)2 = Ω |F−1 p (y) − F−1 q (y)|2dy, where Fp , Fq are cumulative distributions of p, q, respectively. From now on, we call F−1 p the transport coordinates. 9
by F(p) = Ω f(p(x))dx. The Hessian metric of f-entropy in optimal transport space satisfies gT p ( ˙ p, ˙ p) = Ω f (p)|∇2φ|2p(x)2dx, where ˙ p = −∇ · (p∇φ). 10
→ R by h(y) = y 1 f ( 1 z ) 1 z 3 2 dz. Theorem The squared transport Hessian distance of f-entropy has the following formulations. (i) Inverse CDF formulation: DistTH (p, q)2 = 1 0 h(∇y F−1 p (y)) − h(∇y F−1 q (y)) 2dy. (ii) Mapping formulation: DistTH (p, q)2 = Ω h( ∇x T(x) q(x) ) − h( 1 q(x) ) 2q(x)dx, where T is an optimal transport mapping function, such that T# q = p and T(x) = F−1 p (Fq (x)). 11
the study of transport Hessian distances to transport Bregman divergences. Transport KL divergence: DTKL (p q) := 1 0 ∇y F−1 p (y) ∇y F−1 q (y) − log ∇y F−1 p (y) ∇y F−1 q (y) − 1 dy. KL divergence: DKL (p q) = Ω ∇x Fp (x) log ∇x Fp (x) ∇x Fq (x) dx. Here Fp = x p(s)ds, Fq = x q(s)ds are cumulative distributions of probability densities p, q, respectively. 13