Interpolating between Optimal Transport and KL regularized Optimal Transport using Rényi divergences

Interpolating between Optimal Transport & KL regularized Optimal Transport with
Rényi Divergences joint work with Jonas Bresch, TU Berlin University of South Carolina, Columbia, 12.09.2024. Graduate Colloquium (Alec Helm, Jonah Klein).

Motivation - Deficiencies of KL regularized OT Optimal transport (OT)
⇝ distance on probability measures via transport plan. Problem: O(N3) for N samples. Solution: Entropic OT (Cuturi, NeurIPS’13): add ε times KL-regularizer to OT problem for ε > 0. Sinkhorn algorithm ⇝ O(N1+ 1 d ln(N)). Problem in practice: need ε very small to get accurate plan, but ⇝ numerical instabilities. Our solution: Add instead ε times different (=α-Rényi) regularizer and let α ↘ 0 instead of ε ↘ 0. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 2

1. Tsallis divergence and α- Rényi divergence https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Constantino_Tsallis_February_2010.jpg/800px-Constantino_Tsallis_February_2010.jpg, https://repository.aip.org/islandora/object/nbla:310720 2.
Optimal transport and its regularization dsweber2.github.io/Optimal-Transport-Information-Geometry/ 3. Rényi regularized OT c⊥ M(X × X) Π(µ, ν) Πα γ (µ, ν) µ ⊗ ν πc πα,γ c πα,γ→∞ c πα→0,γ c 4. Interpolation properties OT ⟨c, ·⟩ OTε,α OT OTε α↘0 ε→∞ ε↘0 α↗1 5. Solving Rényi regularized OT with mirror descent 6. Numerical results 0 20 40 0 20 40 KL-reg 0.0000 0.0025 0.0050 0.0075 0 20 40 0 20 40 Rényi-reg 0.00 0.02 0.04 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0.015 0 20 40 0 20 40 EMD 0.000 0.025 0.050 0.075 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.00 0.01 0.02 0.03 0 20 40 0 20 40 KL-reg 0.002 0.004 0 20 40 0 20 40 Rényi-reg 0.01 0.02 0 20 40 0 20 40 Tsallis-reg 0.000 0.005 0.010 0 20 40 0 20 40 EMD 0.00 0.01 0.02 0.03 0 20 40 0 20 40 abs-diff(Rényi, EMD) 0.005 0.010

q-Tsallis divergence and α-Rényi divergence Definition (α-Rényi divergence) The α-Rényi
divergence of order α ∈ (0, 1) is Rα : P(X) × P(X) → [0, ∞], (µ | ν) → 1 α − 1 ln X ρµ (x) ρν (x) α dν(x) . where for σ ∈ P(X), ρσ is the density w.r.t. 1 2 (µ + ν), and ln(0) := −∞. Muzellec et. al (AAAI 2017) examine Tsallis-regularized OT. Definition (q-Tsallis divergence) The q-Tsallis divergence of order q > 0, q ̸= 1, is Tq = 1 q − 1 [exp ((q − 1)Rq ) − 1] : P(X) × P(X) → [0, ∞], (µ | ν) → 1 q − 1 X ρµ (x) ρν (x) q dν(x) − 1 Tsallis = 1st order approximation of Rényi since ln(y) ≈ y − 1 (1st order Taylor). Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 4

Rényi versus Tsallis 0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0 1.2 1.4 Renyi divergence from (p,1 p) to (0.25,0.75) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Tsalllis divergence from (p,1 p) to (0.25,0.75) 0.0 0.2 0.4 0.6 0.8 1.0 Theorem (Properties of the Rényi divergence) • Divergence property: Rα (µ | ν) ≥ 0 and Rα (µ | ν) = 0 if and only if µ = ν. • Rα is nondecreasing and continuous in α ∈ [0, 1] with limα↗1 Rα = KL pointwise. • Rα jointly convex, jointly weakly lower semicontinuous for α ∈ (0, 1]. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 5

Wasserstein-p metric space Let (X, d) metric space, with d
lower semicontinuous. Let p ∈ [1, ∞), P(X) the set of probability measures. Pp (X) := µ ∈ P(X) : X d(x, x0 )p dµ(x) < ∞ , x0 ∈ X. On Pp (X), the Wasserstein-p metric is OT(µ, ν)p = min π∈Π(µ,ν) X×X d(x, y)p dπ(x, y), µ, ν ∈ Pp (X), where the transport polytope is Π(µ, ν) := {π ∈ P(X × X) : π(A × X) = µ(A), π(X × A) = ν(A) ∀A} https://dsweber2.github.io/Optimal-Transport-Information-Geometry/ The product measure µ ⊗ ν ∈ Π(µ, ν). Notation: ⟨f, µ⟩ := X f(x) dµ(x), so we can write OT(µ, ν)p = min{⟨dp, π⟩ : π ∈ Π(µ, ν)}. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 7

Cuturi’s Entropic Optimal Transport Regularizer: Kullback-Leibler divergence KL(· | µ
⊗ ν): Π(µ, ν) → [0, ∞), π → X×X ln dπ dµ ⊗ ν (x, y) dµ(x) dν(y) KL-regularized OT: OTε (µ, ν) := min π∈Π(µ,ν) ⟨dp, π⟩ + ε KL(π | µ ⊗ ν) = max f,g∈C(X) f ⊕ g − ε exp − 1 ε (f ⊕ g − dp) , µ ⊗ ν . (f ⊕ g)(x, y) := f(x) + g(y) for f, g ∈ C(X). Primal-dual relation: ˆ πε = exp ˆ f⊕ˆ g−dp ε · µ ⊗ ν Here, c = dp. ©G. Péyre, M. Cuturi, 2019 argmin {KL(π | µ ⊗ ν) : ⟨dp, π⟩ = OT(µ, ν)} ε↘0 ← − − − ˆ πε ε→∞ − − − → µ ⊗ ν OT(µ, ν) ε↘0 ← − − − OTε (µ, ν) ε→∞ − − − → ⟨dp, µ ⊗ ν⟩ Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 8

Discretization and Sinkhorn algorithm Discretize X ≈ (xi )N i=1
µ, ν ∈ P(X) become vectors r := (µ(xi ))N i=1 , c := (ν(xi ))N i=1 ∈ ΣN , where ΣN := x ∈ [0, 1]N : N i=1 xi = 1 . cost matrix: M := (d(xi , xj )p)N i,j=1 . transport polytope: Π(r, c) := {P ∈ ΣN×N : P 1N = r, P T 1N = c} ©T. Vayer. Optimal transport plan as KL-projection of Gibbs kernel ˆ P ε = argmin P ∈Π(r,c) KL P exp −M ε Sinkhorn algorithm finds this projection via matrix scaling. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 9

Rényi-regularized OT Definition (Rényi-regularized OT [Bresch, S. ’24]) The Rényi-regularized
OT problem is OTε,α : Pp (X) × Pp (X) → [0, ∞), (µ, ν) → min π∈Π(µ,ν) ⟨c, π⟩ + εRα (π | µ ⊗ ν). Theorem (OTε,α is a pre-metric [Bresch, S. ’24]) Pp (X)2 ∋ (µ, ν) → 1[µ̸=ν] OTε,α (µ, ν) is a metric for α ∈ (0, 1), ε ∈ [0, ∞). Lemma (Monotonicity of Rényi regularized OT [Bresch, S. ’24]) Let µ, ν ∈ Pp (X), α, α′ ∈ (0, 1) and ε, ε′ ≥ 0 with α > α′ and ε < ε′. Then, we have OTε′,α (µ, ν) ≥ OTε,α (µ, ν) ≥ OTε,α′ (µ, ν). Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 11

Theorem (Interpolation properties). ⟨d, µ ⊗ ν⟩ OT(µ, ν) OTε,α
(µ, ν) OTε (µ, ν) some π ∈ Π(µ, ν) s.t. ⟨d, π⟩ = OT(µ, ν) µ ⊗ ν πε,α (µ, ν) argmin π∈Π(µ,ν), ⟨d,π⟩=OT(µ,ν) Rα (π | µ ⊗ ν) πε α↘0 or ε↘0 ε→∞ α↗1 α↘0 ε↘0 ε→∞ α↗1 Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 12

Mirror descent Solve min x∈K f(x), where K ⊂ Rn
compact. via the updates x(k+1) = argmin y∈K Dh y (∇h)−1 ∇h(x(k)) − ηk ∇f(x(k)) , x(0) ∈ K, ηk > 0, (1) for a convex function h: Rn → R with special properties. https://doi.org/10.48550/arXiv.2108.09489 Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 15

Numerical Experiments - Better transport plans Choose K = ΣN
(probability simplex), −h = Shannon entropy =⇒ Dh = KL. Rényi-regularized OT objective Π(r, c) → [0, ∞), P → ⟨M, P ⟩ + εRα (P | rcT). is not Lipschitz continuous, but locally Lipschitz on {P ∈ Π(r, c) : P |supp(r⊗c) > 0} = Π(c, r) ∩ RN >0 , which suffices for convergence of a mirror descent with special step size (ηk )k∈N (You, Li, 2022). In each iteration one KL projection onto ΣN (using Sinkhorn algorithm) is performed: P (k) ← Sinkhorn P (k−1) ⊙ exp −ηk M − ηk λ α α − 1 (rcT ⊘ P )1−α ⟨P α, (rcT)1−α⟩ ; r, c , k ∈ N . Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 16

Rényi regularization yields more accurate plans Regularized OT plans for
Gaussian (top) and Poisson (bottom) marginals with regularization parameter λ = 10, Rényi order α = 0.01, Tsallis order: q = 2. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 17

Numerical Experiments - Predicting voter migration regularizer, ε = 1
abs error ± std KL error mean squared error KL 2.4221 × 101 ± 2.848 × 101 8.422 × 102 9.008 × 104 Tsallis 9.409 ± 1.529 × 101 3.173 × 102 2.063 × 104 OT 1.845 × 101 ± 2.358 × 101 7.655 × 102 5.738 × 104 3 10 -Renyi 6.611 ± 7.868 2.128 × 102 6.759 × 103 Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 18

Conclusion − Contribution. Regularize optimal transport problem using the α-Rényi-divergences
Rα for α ∈ (0, 1). Prove dual formulation and interpolation properties. − Prior work. Regularization with KL = limα↗1 Rα and with q-Tsallis divergence − Method. Solve primal problem with mirror descent and dual problem with subgradient descent. − Result. Rényi-regularized OT plans outperform KL / Tsallis regularized OT plans on real and synthetic data. − Novelty. Rα ̸∈ {f-divergence, Bregman divergence} and Rα not “separable” due to the logarithm. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 19

Thank you for your attention! I am happy to take
any questions. Paper link: https://arxiv.org/abs/2404.18834 My website: https://viktorajstein.github.io [PC19, Cut13, MNPN17, vEH14, BT03, NY83, NS21, Rén61, Tsa88] Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 20

References I [BT03] Amir Beck and Marc Teboulle, Mirror descent
and nonlinear projected subgradient methods for convex optimization, Oper. Res. Lett. 31 (2003), no. 3, 167–175. [Cut13] Marco Cuturi, Sinkhorn distances: lightspeed computation of optimal transport, Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (Red Hook, NY, USA), NIPS’13, Curran Associates Inc., 2013, p. 2292–2300. [MNPN17] Boris Muzellec, Richard Nock, Giorgio Patrini, and Frank Nielsen, Tsallis regularized optimal transport and ecological inference, Proceedings of the AAAI conference on Artificial Intelligence (Hilton San Francisco, San Francisco, California, USA), vol. 31, 2017. [NS21] Sebastian Neumayer and Gabriele Steidl, From optimal transport to discrepancy, Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging: Mathematical Imaging and Vision (2021), 1–36. [NY83] Arkadij Semenovič Nemirovskij and David Borisovich Yudin, Problem complexity and method efficiency in optimization, Wiley, New York, 1983. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 1

References II [PC19] Gabriel Peyré and Marco Cuturi, Computational optimal
transport, Found. Trends Mach. Learn. 11 (2019), no. 5-6, 355–607. [Rén61] Alfréd Rényi, On measures of entropy and information, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics (Statistical Laboratory of the University of California, Berkeley, California, USA), vol. 4, University of California Press, 1961, pp. 547–562. [Tsa88] Constantino Tsallis, Possible generalization of boltzmann-gibbs statistics, Journal of statistical physics 52 (1988), 479–487. [vEH14] Tim van Erven and Peter Harremos, Rényi divergence and Kullback-Leibler divergence, IEEE Trans. Inf. Theory 60 (2014), no. 7, 3797–3820. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 2

Work in progress - Rényi-Sinkhorn Divergence OTε,α (µ, µ) ̸=
0 To obtain valid, differentiable distance: Dε,α (µ, ν) := OTε,α (µ, ν) − 1 2 OTε,α (µ, µ) − 1 2 OTε,α (ν, ν). Can be used for gradient flows. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 3 0 5 10 15 20 25 0.00 0.05 0.10 0.15 0.20

First: A different way of regularizing For regularization parameter γ
∈ [0, ∞] and α ∈ (0, 1), the restricted transport polytope, Πα γ (µ, ν) := {π ∈ Π(µ, ν) : Rα (π | µ ⊗ ν) ≤ γ} , is weakly compact, since Rα (· | µ ⊗ ν) is weakly lsc. and Π(µ, ν) is weakly compact. Definition (Rényi-Sinkhorn distance) The Rényi-Sinkhorn distance between µ, ν ∈ Pp (X) is dγ,α : Pp (X) × Pp (X) → R, (µ, ν) → min ⟨dp, π⟩1 p : π ∈ Πα γ (µ, ν) . (2) Theorem (Bresch, S. ’24) • For (µ, ν) ∈ Pp (X), the optimization problem (2) is convex and has a unique minimizer. • Pp (X)2 ∋ (µ, ν) → 1[µ̸=ν] (µ, ν)dγ,α (µ, ν) is a metric for α ∈ (0, 1), γ ∈ [0, ∞]. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 4

c⊥ M(X × X) Π(µ, ν) Πα γ (µ, ν)
µ ⊗ ν πc πα,γ c Transport polytope Π(µ, ν), restricted transport polytope Πα γ (µ, ν) for c = dp. (Plot inspired by (Cuturi, 2013).) Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 5

The Dual Point of View - Penalizing the Constraint Instead
of restricting the problems domain, penalize the Rényi divergence constraint in (2). Definition (Dual Rényi-Divergence-Sinkhorn distance) The dual Rényi-Divergence-Sinkhorn distance for α ∈ (0, 1), ε ∈ [0, ∞) is dα,ε : Pp (X) × Pp (X) → R, (µ, ν) → ⟨dp, πα,ε(µν)⟩1 p , where πα,ε(µ, ν) ∈ argmin {⟨dp, π⟩ + εRα (π | µ ⊗ ν) : π ∈ Π(µ, ν)} . (3) Theorem (Lagrangian point of view and pre-metric [Bresch, S. ’24]) Let (µ, ν) ∈ Pp (X). • The optimization problem (3) is convex and has a unique minimizer. • Rényi-Sinkhorn dγ,α (µ, ν) and dual Rényi-Sinkhorn dα,λ(µ, ν) are equivalent: for γ > 0, there exists ε ∈ [0, ∞], such that ⟨dp, πα,ε(µ, ν)⟩ = dγ,α (µ, ν)p. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 6

Dual formulation, Representation of πα,λ From now on: X compact.
The dual space of all finite signed Borel measures on X, M(X), is C(X), the space of real-valued continuous functions on X. Recall (f ⊕ g)(x, y) := f(x) + g(y). Theorem (Dual problem, dual representation [Bresch, S. ’24]) We have the strong duality OTε,α (µ, ν) = max f,g∈C(X) f⊕g≤d ⟨f ⊕ g, µ ⊗ ν⟩ − ε ln (d − f ⊕ g) α α−1 , µ ⊗ ν + Cα,ε . (4) The optimal dual potentials ˆ f, ˆ g ∈ C(X) from (4) are unique supp(µ ⊗ ν)-a.e. up to additive constants and the unique optimal plan is πα,ε ∝ (d − ˆ f ⊕ ˆ g) 1 α−1 · (µ ⊗ ν). Proof idea. Use Fenchel-Rockafellar theorem, extend objective to M(X) × M(X) by ∞. Viktor Stein Interpolating between OT and KL-reg. OT with Rényi divergences September 12th, 2024 7

Interpolating between Optimal Transport and KL ...

Interpolating between Optimal Transport and KL regularized Optimal Transport using Rényi divergences

Viktor Stein

More Decks by Viktor Stein

Other Decks in Research

Featured

Transcript

Interpolating between Optimal Transport & KL regularized Optimal Transport with

Motivation - Deficiencies of KL regularized OT Optimal transport (OT)

1. Tsallis divergence and α- Rényi divergence https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Constantino_Tsallis_February_2010.jpg/800px-Constantino_Tsallis_February_2010.jpg, https://repository.aip.org/islandora/object/nbla:310720 2.

q-Tsallis divergence and α-Rényi divergence Definition (α-Rényi divergence) The α-Rényi

Rényi versus Tsallis 0.0 0.2 0.4 0.6 0.8 1.0 0.0

1. Tsallis divergence and α- Rényi divergence https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Constantino_Tsallis_February_2010.jpg/800px-Constantino_Tsallis_February_2010.jpg, https://repository.aip.org/islandora/object/nbla:310720 2.

Wasserstein-p metric space Let (X, d) metric space, with d

Cuturi’s Entropic Optimal Transport Regularizer: Kullback-Leibler divergence KL(· | µ

Discretization and Sinkhorn algorithm Discretize X ≈ (xi )N i=1

1. Tsallis divergence and α- Rényi divergence https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Constantino_Tsallis_February_2010.jpg/800px-Constantino_Tsallis_February_2010.jpg, https://repository.aip.org/islandora/object/nbla:310720 2.

Rényi-regularized OT Definition (Rényi-regularized OT [Bresch, S. ’24]) The Rényi-regularized

Theorem (Interpolation properties). ⟨d, µ ⊗ ν⟩ OT(µ, ν) OTε,α

1. Tsallis divergence and α- Rényi divergence https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Constantino_Tsallis_February_2010.jpg/800px-Constantino_Tsallis_February_2010.jpg, https://repository.aip.org/islandora/object/nbla:310720 2.

Mirror descent Solve min x∈K f(x), where K ⊂ Rn

Numerical Experiments - Better transport plans Choose K = ΣN

Rényi regularization yields more accurate plans Regularized OT plans for

Numerical Experiments - Predicting voter migration regularizer, ε = 1

Conclusion − Contribution. Regularize optimal transport problem using the α-Rényi-divergences

Thank you for your attention! I am happy to take

References I [BT03] Amir Beck and Marc Teboulle, Mirror descent

References II [PC19] Gabriel Peyré and Marco Cuturi, Computational optimal

Work in progress - Rényi-Sinkhorn Divergence OTε,α (µ, µ) ̸=

First: A different way of regularizing For regularization parameter γ

c⊥ M(X × X) Π(µ, ν) Πα γ (µ, ν)

The Dual Point of View - Penalizing the Constraint Instead

Dual formulation, Representation of πα,λ From now on: X compact.