WORKSHOP ON OPTIMAL TRANSPORT
FROM THEORY TO APPLICATIONS
INTERFACING DYNAMICAL SYSTEMS, OPTIMIZATION, AND MACHINE LEARNING
Venue: Humboldt University of Berlin, Dorotheenstraße 24
Italian Institute of Technology and University College London Vladimir Kostic Karim Lounici Pietro Novelli Also thanks to: Carlo Ciliberto, Antoine Chatalic, Riccardo Grazzi, Vladimir Kos ti c, Karim Lounici Prune Inzerilli, Andreas Maurer, Giacomo Mean ti , Pietro Novelli, Lorenzo Rosasco, Giacomo Turri
key part of science and engineering • They are mathema ti cal models of temporally evolving phenomena, described by a state variable evolving over ti me xt ∈ 𝒳 t
an observable (i.e. any func ti on of the state) one step ahead in the future: (A f )(x) = 𝔼 [ f(Xt+1 ) | Xt = x ] A : ℱ → ℱ The Koopman/Transfer Operator
expected value of an observable (i.e. any func ti on of the state) one step ahead in the future: • globally linearizes the DS within a suitable invariant subspace: A A[ℱ]⊆ℱ (A f )(x) = 𝔼 [ f(Xt+1 ) | Xt = x ] A : ℱ → ℱ
iden ti fi ed by an matrix evolving func ti on coe ff i cients: ℱ=span{ϕ1 , . . , ϕr } A r×r G ⟨w, ϕ( ⋅ )⟩ ↦ ⟨Gw, ϕ( ⋅ )⟩ (A f )(x) = 𝔼 [ f(Xt+1 ) | Xt = x ] A : ℱ → ℱ • The forward transfer operator returns the expected value of an observable (i.e. any func ti on of the state) one step ahead in the future: • globally linearizes the DS within a suitable invariant subspace: A A[ℱ]⊆ℱ
(self-adjoint compact case): where are eigenfunc ti ons of with eigenvalue A = ∞ ∑ i=1 λi fi ⊗ fi fi A λi 𝔼 [ f(Xt )|X0 = x ]=(Atf )(x) =∑ i λt i fi (x) ⟨f, fi ⟩ • It disentangles an observable forecast into temporal and sta ti c components • A good is given by the span of leading eigenfunc ti ons ℱ
π( 𝒳 ) L2 π( 𝒳 ) Aπ G … → xt ↓ ↓ … →ϕ(xt )→ϕ(xt+1 )→ … ⟶ xt+1 → … loss: ∥ϕ(xt+1 ) − G*ϕ(xt )∥2 • We aim to minimize the risk: R(G) = 𝔼 (X,Y)∼ρ ∥ϕ(Y) − G*ϕ(X)∥2 representa ti on: ℋ ℋ • We restrict to a RKHS and look for an operator such that Aπ ℋ G : ℋ→ℋ Aπ ⟨w, ϕ( ⋅ )⟩≈⟨Gw, ϕ( ⋅ )⟩, that is state sequence:
)k(xj , x) • Solu ti on has the form: n ∑ i=1 ∥ϕ(yi ) − G*ϕ(xi )∥2 + γ∥G∥2 HS Empirical Risk Minimization • Given a sample learn minimizing: (xi , yi =xi+1 )n i=1 ∼ρ ̂ G:ℋ→ℋ Ridge Regression Principal Component Reduced Rank Regression Low-rank: Minimizes the risk on a feature subspace spanned by the principal components Low-rank: Adds a rank constraint leading to a generalized eigenvalue problem Full-rank solu ti on WKRR = (K + γI)−1 where K=(k(xi , xj ))n i,j=1
𝒳 ) ̂ G Aπ| ℋ E( ̂ G) = sup ∥h∥ℋ ≤1 ∥(Aπ − ̂ G)h∥L2 π Es ti ma ti on error: Metric distorsion: η(h) = ∥h∥ℋ ∥h∥L2 π • We wish to bound and link the spectra of to that of E( ̂ G) ̂ G Aπ
π( 𝒳 ) ̂ G Aπ| ℋ E( ̂ G) = ∥Aπ| ℋ − ̂ G∥ℋ→L2 π • Both quan ti ti es appear in spectral errors bounds: ̂ G ̂ h = ̂ λ ̂ h ⇒ ∥Aπ ̂ h − ̂ λ ̂ h∥L2 π ≤ E( ̂ G) η( ̂ h) Metric distorsion: η(h) = ∥h∥ℋ ∥h∥L2 π • Even if is small spectra es ti ma ti on may fail! E( ̂ G) Es ti ma ti on error: ∥ ̂ h∥L2 π
π( 𝒳 ) ̂ G Aπ| ℋ E( ̂ G) = ∥Aπ| ℋ − ̂ G∥ℋ→L2 π • Both quan ti ti es appear in spectral errors bounds: ̂ G ̂ h = ̂ λ ̂ h ⇒ ∥Aπ ̂ h − ̂ λ ̂ h∥L2 π ≤ E( ̂ G) η( ̂ h) Metric distorsion: η(h) = ∥h∥ℋ ∥h∥L2 π • KRR is problema ti c because may deteriorate with the es ti mator rank η( ̂ h) Es ti ma ti on error: ∥ ̂ h∥L2 π
ℋ ℋ L2 π( 𝒳 ) L2 π( 𝒳 ) G Pℋ Es ti ma ti on error decomposi ti on ≤ ∥(I−Pℋ )Aπ| ℋ ∥ + ∥Pℋ Aπ| ℋ −G∥ + ∥G− ̂ G∥ representa ti on error es ti mator bias es ti mator variance E( ̂ G) η(h) = ∥h∥ℋ ∥C1/2h∥ℋ All operator norms are from to ℋ L2 π : true version of G ̂ G Metric distorsion: Projec ti on operator: Pℋ f = argmin h∈ℋ ∥f − h∥L2 π , f ∈L2 π Aπ| ℋ Covariance operator: C = 𝔼 X∼π ϕ(X) ⊗ ϕ(X)
1: Let . With probability at least , the RRR es ti mator sa ti s fi es ϵn = n− α 2(α + β) 1− δ spectral bias |λi (Aπ ) − ̂ λi | ≲ σr+1 (Aπ| ℋ ) σr (Aπ| ℋ ) +ϵn ln 1 δ • : e ff ec ti ve dim. of in (spectra decay of ) • : “size” of the image of in β∈[0,1] ℋ ℒ2 π C α∈(0,2] Aπ| ℋ ℒ2 π E( ̂ G) ≤ σr+1 (Aπ| ℋ ) + ϵn ln 1 δ Rate is minimax op ti mal
Let . With probability at least , the PCR es ti mator sa ti s fi es ϵn = n− α 2(α + β) 1− δ • : e ff ec ti ve dim. of in (spectra decay of ) • : “size” of the image of in β∈[0,1] ℋ ℒ2 π C α∈(0,2] Aπ| ℋ ℒ2 π E( ̂ G) ≤ σ1/2 r+1 (C) + ϵn ln 1 δ |λi (Aπ ) − ̂ λi | ≲ 2σ1/2 r+1 (C) |σr (Aπ| ℋ ) − σα/2 r+1 (C)| + +ϵn ln 1 δ spectral bias Irreducible bias even if has rank Aπ| ℋ r
̂ hi ∥ℒ2 π ≤ 2|λi − ̂ λi | |gapi (Aπ ) − |λi − ̂ λ || + • Es ti ma ti on of the eigenfunc ti on is linked to es ti ma ti on error for eigenvalue and spectral gap gapi (Aπ ) = min j≠i |λi (Aπ ) − λj (Aπ )|
̂ hi ∥ℒ2 π ≤ 2|λi − ̂ λi | |gapi (Aπ ) − |λi − ̂ λ || + • Eigenfunc ti ons’ es ti ma ti on error is linked to eigenvalues’ error and spectral gap gapi (Aπ ) = min j≠i |λi (Aπ ) − λj (Aπ )| 1D quadruple-well potential • RRR estimates dominant eigenfunctions better than PCR
if with density w.r.t. then A* π Xt−1 ∼μt−1 qt−1 ∈ ℒ2 π ( 𝒳 ) π qt = A* π qt−1 Kμt = G*Kμt−1 • Estimators based on a deflate-learn-inflate procedure obtain uniform in time forecasting bounds • Estimators can be used to forecast distributions. Further, if is invariant and the kernel mean embedding, then ℋ Kμt = 𝔼 X∼μt ϕ(X)
ℋ • We assume is compact, so it admits an SVD: • A good should: (1) approximate the operator well: (2) have low representa ti on error: (3) have low metric distor ti ons: (i.e. ) Aπ Aπ = ∞ ∑ i=1 σi ui ⊗ v* i ℋ Pℋ Aπ ≈ Aπ ∥(I−Pℋ )Aπ|ℋ ∥ ≈ 0 η(h) ≈ 1, ∀h ∈ ℋ Cℋ ≈ I
ℋ • If desiderata are met: • Moreover if is normal the representa ti on error is zero ℋ = span(u1 , …, ur ) ∥(I−Pℋ )Aπ|ℋ ∥≤ σr+1 ∥C1/2 ℋ ∥≈σr+1 Aπ • We assume is compact, so it admits an SVD: • A good should: (1) approximate the operator well: (2) have low representa ti on error: (3) have low metric distor ti ons: (i.e. ) Aπ Aπ = ∞ ∑ i=1 σi ui ⊗ v* i ℋ Pℋ Aπ ≈ Aπ ∥(I−Pℋ )Aπ|ℋ ∥ ≈ 0 η(h) ≈ 1, ∀h ∈ ℋ Cℋ ≈ I
ℋ • We introduce the relaxed di ff eren ti able score • If the representa ti on is rich, both scores have the same set of maximizers ∥Cw XY ∥2 HS ∥Cw X ∥ ∥Cw Y ∥ − γ∥I−Cw X ∥2 HS − γ∥I−Cw Y ∥2 HS • To learn the SVD we learn func ti ons maximizing the CCA score ψw , ψ′  w : 𝒳 → ℝr ∥Pℋw Aπ Pℋ′  w ∥HS = ∥(Cw X )−1/2Cw XY (Cw Y )−1/2∥2 HS
(see paper) Two Examples - Advantage of DPNets [Kostic et al. ICLR 2024] • We can use the learned for forecas ti ng or interpretability ψw Images predicted by our method (rows 1 and 2) vs. other methods
energy (then used for atomistic simulations) [Falk. et al. NeurIPS 2023] [Wang et al. T-PAMI 2023] • A theory explaining why pre-training works well for meta-learning 𝔼 ℰ(Wpre N , θpre N , μ) ≤ min W,θ ℰ(W, θ, μ) + O(1/ nT) Theorem
of discrete stochastic dynamical systems, leading to efficient and reliable algorithms • We derived spectral learning bounds and addressed representation learning • Future work: study continuous, non-stationary, interactive DS, and incorporate invariances into the estimators Thanks!
C. Ciliberto, L. Rosasco, M. Pontil. Learning dynamical systems via Koopman operator regression in reproducing kernel hilbert spaces. NeurIPS 2022. • V. Kostic, K. Lounici, P. Novelli, M. Pontil. Koopman Operator Learning: Sharp Spectral Rates and Spurious Eigenvalues. NeurIPS 2023. • G. Meanti, A. Chatalic, V. Kostic, P. Novelli, M. Pontil, L. Rosasco. Estimating Koopman operators with sketching to provably learn large scale dynamical systems. NeurIPS 2023. • V. Kostic, P. Novelli, R. Grazzi, K. Lounici, M. Pontil. Learning invariant representations of time-homogeneous stochastic dynamical systems. ICLR 2024. • P. Inzerilli, V. Kostic, K. Lounici, P. Novelli., M. Pontil. Consistent Long-Term Forecasting of Ergodic Dynamical Systems. Submitted 2024. • G. Turri, V. Kostic, P. Novelli, M. Pontil. A randomized algorithm to solve reduced rank operator regression. Submitted, 2024. Code: https://github.com/Machine-Learning-Dynamical-Systems/kooplearn