Upgrade to Pro — share decks privately, control downloads, hide ads and more …

非凸確率的最適化と再生核ヒルベルト空間の最適化

Taiji Suzuki
June 18, 2023
97

 非凸確率的最適化と再生核ヒルベルト空間の最適化

「機械学習における最適化理論と学習理論的側面@組合せ最適化セミナー2020」第二部

Taiji Suzuki

June 18, 2023
Tweet

Transcript

  1. ඇತؔ਺Ͱͷ SGD ໨తؔ਺ɿL(x) = Ez [ℓ(z, x)]ɽ (Լʹ༗ք, L∗ =

    infx L(x) ͱ͢Δ) SGD zt ∼ P(Z) Λ؍ଌɽℓt (x) := ℓ(zt, x) ͱ͢Δ. gt ∈ ∂x ℓt (xt−1 ). xt = xt−1 − ηt gt. Ծఆ (A1) L ͸ γ-ฏ׈ (A2) E[∥gt − E[gt ]∥2] = σ2 (֬཰తޯ഑ͷ෼ࢄ͸ σ2). ηt = min { ˜ D σ √ T , 1 γ } ͱ͢Δͱ (Ghadimi and Lan (2013)) min 1≤t≤T E[∥∇L(xt )∥2] ≤ γσ √ T ( D2 f ˜ D + ˜ D ) + γ2D2 f T , ͨͩ͠ɼDf = √ 2(L(x1)−L∗) γ ɽ ʢඍ෼͕ 0 ΁ऩଋͯ͠Ώ͘͜ͱΛอূʣ ࠨลͷ min1≤t≤T ͷ୅ΘΓʹɼˆ t ∈ {1, . . . , T} ΛҰ༷෼෍ʹैͬͯબΜͰ E[∥∇L(xˆ t )∥2] ͱͯ͠΋ྑ͍ɽ 5 / 42
  2. ඇತ SVRG min x∈Rp L(x) = min x∈Rp 1 n

    n ∑ i=1 ℓi (x) SVRG Λͦͷ··ඇತؔ਺࠷దԽʹద༻ͯ͠Α͍ɽ ʢͨͩ͠εςοϓαΠζͱϛχ όον਺͸ద੾ʹௐ੔ʣ E[∥∇L(ˆ x)∥2] ≤ ϵ ʹͳΔ·Ͱͷߋ৽ճ਺ T (Allen-Zhu and Hazan, 2016, Reddi et al., 2016) ℓi ͕γ-ฏ׈ͷ࣌ɿ T ≥ Ω ( n + n2/3 ϵ ) . ʢී௨ͷඇ֬཰తޯ഑๏ͳΒ Ω(n/ϵ)ʣ ℓi ͕ γ-ฏ׈͔ͭ L(x) − L(x∗) ≤ τ∥∇L(x)∥2 (∀x) (x∗ ͸େҬత࠷దղ) ͷ࣌ (Polyak- Lojasiewicz, PL ৚݅)ɿ T ≥ Ω ( (n + τn2/3)log(1/ϵ) ) . ʢී௨ͷඇ֬཰తޯ഑๏ͳΒ Ω(τn log(1/ϵ))ʣ 6 / 42
  3. SARAH ͱͦͷվྑ๏ SSRGD (Li, 2019): SARAH + ϊΠζ෇ՃʹΑΔҌ఺୤ग़ Simple Stochastic

    Recursive Gradient Descent (SSRGD) Iterate the following for t = 1, 2, . . . , T: 1 Ҍ఺୤ग़Ϟʔυʹೖ͓ͬͯΒͣɼ∥∇L(xt )∥ ≤ gthresh ͳΒɼ xt ← xt + ξ (ξ ∼ Unif(Br (Rd ))) ͱͯ͠ɼҌ఺୤ग़ϞʔυʹೖΔɽ 2 y0 = xt , v0 = ∇f (xt ) 3 For k = 1, . . . , m, 1 yk = yk−1 − ηvk−1 2 vk = 1 b ∑ i∈Ik (∇fi (yk ) − ∇fi (yk−1 )) + vk−1 (SARAH: variance reduction) 3 ͋Δఀࢭ৚݅Λຬ͍ͨͯͨ͠ΒҌ఺୤ग़ϞʔυΛࢭΊΔɽ 4 xt+1 = ym Output: xT SARAH: StochAstic Recursive grAdient algoritHm (Nguyen et al., 2017, Pham et al., 2020) ΦϯϥΠϯܕͷ৔߹͸ ∇L ͷܭࢉ͸αϯϓϧฏۉʹ͢Δ 1 B ∑ i∈It ∇fi (xt ). ೋ࣍࠷దੑ΋ߴ͍֬཰Ͱอূ 8 / 42
  4. SARAH ʹ͍ͭͯ SARAH: vk = 1 b ∑ i∈Ik (∇fi

    (yk ) − ∇fi (yk−1 )) + vk−1 SVRG: vk = 1 b ∑ i∈Ik (∇fi (yk ) − ∇fi (˜ x)) + ˜ v SVRG ͸಺෦ϧʔϓͷߋ৽ΛਐΊΔͱ෼ࢄ͕େ͖͘ͳΔɽ SARAH ͸಺෦ϧʔϓͷߋ৽ΛਐΊͯ΋෼ࢄ͕େ͖͘ͳΒͳ͍ or 0 ʹऩଋ͢Δ (ڧತͷ৔߹) ˠ ޯ഑͕๫ΕͣɼҰ࣌࠷దੑ৚݅Λຬͨ͢ղΛಘ΍͍͢ɽ (ತ࠷దԽͰ໨తؔ਺஋Λݟ͍ͯΔݶΓ͸͜ͷҧ͍͕ݟʹ͍͘) 9 / 42
  5. ܭࢉྔͷൺֱ ϵ-Ұ࣍࠷దੑ৚݅: E[∥∇L(x)∥2] ≤ ϵ δ-ೋ࣍࠷దੑ৚݅: λmin (∇2L(x)) ≥ −δ

    (with high probability) ΦϯϥΠϯܕ ख๏ ֬཰తޯ഑ͷܭࢉ਺ ࠷దੑ৚݅ GD O(n ϵ ) 1 ࣍ SVRG (Allen-Zhu and Hazan, 2016) O(n + n2/3 ϵ ) 1 ࣍ SARAH (Pham et al., 2020) O(n + √ n ϵ ) 1 ࣍ SSRGD (Li, 2019) O(n + √ n ϵ ) 1 ࣍ PGD (Jin et al., 2017b) O(n ϵ + n δ4 ) 2 ࣍ SSRGD (Li, 2019) O( √ n ϵ + √ n δ4 + n δ3 ) 2 ࣍ ༗ݶ࿨ܕ ख๏ ֬཰తޯ഑ͷܭࢉ਺ ࠷దੑ৚݅ SGD (Ghadimi and Lan, 2013) O(1/ϵ2) 1 ࣍ SVRG+ (Li and Li, 2018) O(1/ϵ7/4) 1 ࣍ SARAH (Pham et al., 2020) O(1/ϵ3/2) 1 ࣍ SSRGD (Li, 2019) O(1/ϵ3/2) 1 ࣍ SSRGD (Li, 2019) O( 1 ϵ3/2 + 1 ϵδ3 + 1 ϵ1/2δ4 ) 2 ࣍ 10 / 42
  6. ʢࢀߟʣStrict saddle ਂ૚ֶशͳͲ͸ఀཹ఺͕ଟ͍ɽ ໨తؔ਺͕ strict saddle property ͱ͍͏ੑ࣭Λຬ͍ͨͯ͠Ε͹ɼαυϧϙΠ ϯτΛճආ͢Δ͜ͱ͕Ͱ͖Δɽ 

     ৴པྖҬ๏ (Conn et al., 2000) ΍ࡶԻΛՃ͑ͨ֬཰తޯ഑๏ (Ge et al., 2015) ͸ strict saddle ͳ໨తؔ਺ͷہॴత࠷దղʹ౸ୡ͢Δ (Sun et al., 2015).   ˞ ղʹࡶԻΛՃ͑Δ͜ͱͰαυϧϙΠϯτ͔Βൈ͚ग़ͤΔɽ Strict saddle ೋճඍ෼Մೳͳؔ਺ f ͕ strict saddle Ͱ͋Δͱ͸ɼ∀x Ͱ࣍ͷͲΕ͔͕ຬͨ͞Ε ͍ͯΔ: ∥∇f (x)∥ ≥ ϵ. λmin (∇2f (x)) ≤ −γ. ͋Δ x∗ ͕ଘࡏͯ͠ ∥x − x∗∥ ≤ δ ͔ͭ f (x) ͕ x∗ ͷۙ๣ {x′ | ∥x∗ − x′∥ ≤ 2δ} Ͱڧತؔ਺. E.g., ςϯιϧ෼ղ maxu∈Rp ⟨ ∑ d r=1 a⊗4 r , u ⊗ u ⊗ u ⊗ u ⟩ ͸ a⊤ r ar′ = δr,r′ ͳΒ strict saddleɽ 11 / 42
  7. ઢܗ੍໿͋Γͷֶश໰୊ min x 1 n n ∑ i=1 fi (z⊤

    i x) + ψ(B⊤x) ⇔ min x,y 1 n n ∑ i=1 fi (z⊤ i x) + ψ(y) s.t. y = B⊤x.   ֦ுϥάϥϯδΞϯ L(x, y, λ) = 1 n ∑ i fi (z⊤ i x) + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2   inf x,y sup λ L(x, y, λ) Ͱ࠷దղ͕ٻ·Δɽ ৐਺๏: Hestenes (1969), Powell (1969), Rockafellar (1976). ަޓํ޲৐਺๏ (ADMM): Gabay and Mercier (1976), Mota et al. (2011), He and Yuan (2012), Deng and Yin (2012), Hong and Luo (2012a) ֬཰తަޓํ޲৐਺๏: SGD-ADMM (Suzuki, 2013, Ouyang et al., 2013), RDA-ADMM (Suzuki, 2013), SDCA-ADMM (Suzuki, 2014), SVRG-ADMM (Zheng and Kwok, 2016), ASVRG-ADMM (Liu et al., 2017). 12 / 42
  8. ߏ଄తਖ਼ଇԽͷྫ Overlapped group lasso ˜ ψ(w) = C ∑ g∈G

    ∥wg ∥ It is difficult to compute the proximal mapping. Solution: Prepare ψ for which proximal mapping is easily computable. Let ψ(B⊤w) = ˜ ψ(w), and utilize the proximal mapping w.r.t. ψ. BTw Decompose into independent groups: B⊤w = ψ(y) = C ∑ g′∈G′ ∥yg′ ∥ prox(q|ψ) = ( qg′ max { 1 − C ∥qg′ ∥ , 0 }) g′∈G′ 13 / 42
  9. ͦͷଞͷྫ Graph guided regularization ˜ ψ(w) = C ∑ (i,j)∈E

    |wi − wj |.      x1 x2 x3 x4 x5 ψ(y) = C ∑ e∈E |ye|, y = B⊤w = (wi − wj )(i,j)∈E ⇒    ψ(B⊤w) = ˜ ψ(w), prox(q|ψ) = ( qe max { 1 − C |qe | , 0 }) e∈E . Soft-Thresholding function. 14 / 42
  10. ߏ଄తਖ਼ଇԽʹର͢Δަޓํ޲৐਺๏ min x {f (x) + ψ(B⊤w)} ⇔ min x,y

    {f (x) + ψ(y) s.t. y = B⊤x} L(x, y, λ) = f (x) + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2 ͨͩ͠ f (x) = 1 n ∑ fi (z⊤ i x) ADMM ʹΑΔߏ଄తਖ਼ଇԽֶश x(t) = arg min x {f (x) + λ(t−1)⊤ (−B⊤x) + ρ 2 ∥y(t−1) − B⊤x∥2} y(t) = arg min y {ψ(y) + λ(t)⊤ y + ρ 2 ∥y − B⊤x(t)∥2} (= prox(B⊤x(t) − λ(t)/ρ|ψ/ρ)) λ(t) = λ(t−1) − ρ(B⊤x(t) − y(t)) y ͷߋ৽͸୯७ͳ ψ ʹΑΔۙ઀ࣸ૾. → ղੳղ. Ұൠతʹ͸ O(1/k) (He and Yuan, 2012), ڧತͳΒ͹ઢܗऩଋ (Deng and Yin, 2012, Hong and Luo, 2012b)ɽ 15 / 42
  11. SGD-ADMM minx EZ [ℓ(x, Z)] + ψ(B⊤x) ⇒ ֦ுϥάϥϯδΞϯ: EZ

    [ℓ(x, Z)] + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2. ௨ৗͷ SGD: xt+1 = arg minx { ⟨gt , x⟩ + ˜ ψ(x) + 1 2ηt ∥x − xt ∥2 } (gt ∈ ∂x ℓ(xt , zt )). SGD-ADMM xt+1 =argmin x∈X { g⊤ t x − λt ⊤(B⊤x − yt ) + ρ 2 ∥B⊤x − yt∥2 + 1 2ηt ∥x − xt∥2 Gt } , yt+1 = argmin y∈Y { ψ(y) − λ⊤ t (B⊤xt+1 − y) + ρ 2 ∥B⊤xt+1 − y∥2 } λt+1 =λt − ρ(B⊤xt+1 − yt+1 ). yt+1 ͱ λt+1 ͷߋ৽͸௨ৗͷ ADMM ͱಉ͡ɽ Gt ͸೚ҙͷਖ਼ఆ஋ରশߦྻɽ 16 / 42
  12. SGD-ADMM minx EZ [ℓ(x, Z)] + ψ(B⊤x) ⇒ ֦ுϥάϥϯδΞϯ: EZ

    [ℓ(x, Z)] + ψ(y) + λ⊤(y − B⊤x) + ρ 2 ∥y − B⊤x∥2. ௨ৗͷ SGD: xt+1 = arg minx { ⟨gt , x⟩ + ˜ ψ(x) + 1 2ηt ∥x − xt ∥2 } (gt ∈ ∂x ℓ(xt , zt )). SGD-ADMM xt+1 =argmin x∈X { g⊤ t x − λt ⊤(B⊤x − yt ) + ρ 2 ∥B⊤x − yt∥2 + 1 2ηt ∥x − xt∥2 Gt } , yt+1 = argmin y∈Y { ψ(y) − λ⊤ t (B⊤xt+1 − y) + ρ 2 ∥B⊤xt+1 − y∥2 } =prox(B⊤xt+1 − λt/ρ|ψ), λt+1 =λt − ρ(B⊤xt+1 − yt+1 ). yt+1 ͱ λt+1 ͷߋ৽͸௨ৗͷ ADMM ͱಉ͡ɽ Gt ͸೚ҙͷਖ਼ఆ஋ରশߦྻɽ 16 / 42
  13. RDA-ADMM ௨ৗͷ RDA: wt+1 = arg minw { ⟨¯ gt

    , w⟩ + ˜ ψ(w) + 1 2ηt ∥w∥2 } (¯ gt = 1 t (g1 + · · · + gt )) RDA-ADMM Let ¯ xt = 1 t ∑ t τ=1 xτ , ¯ λt = 1 t ∑ t τ=1 λτ , ¯ yt = 1 t ∑ t τ=1 yτ , ¯ gt = 1 t ∑ t τ=1 gτ . xt+1 =argmin x∈X { ¯ g⊤ t x − (B¯ λt )⊤x + ρ 2t ∥B⊤x∥2 +ρ(B⊤¯ xt − ¯ yt )⊤B⊤x + 1 2ηt ∥x∥2 Gt } , yt+1 =prox(B⊤xt+1 − λt/ρ|ψ), λt+1 =λt − ρ(B⊤xt+1 − yt+1 ). yt+1 ͱ λt+1 ͷߋ৽͸௨ৗͷ ADMM ͱಉ͡ɽ 17 / 42
  14. Convergence analysis We bound the expected risk: Expected risk P(x)

    = EZ [ℓ(Z, x)] + ˜ ψ(x). Assumptions: (A1) ∃G s.t. ∀g ∈ ∂x ℓ(z, x) satisfies ∥g∥ ≤ G for all z, x. (A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y. (A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R. 18 / 42
  15. Convergence rate: bounded gradient (A1) ∃G s.t. ∀g ∈ ∂x

    ℓ(z, x) satisfies ∥g∥ ≤ G for all z, x. (A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y. (A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R. Theorem (Convergence rate of RDA-ADMM) Under (A1), (A2), (A3), we have Ez1:T−1 [P(¯ xT ) − P(x∗)] ≤ 1 T T ∑ t=2 ηt−1 2(t − 1) G2 + γ ηT ∥x∗∥2 + K T . Theorem (Convergence rate of SGD-ADMM) Under (A1), (A2), (A3), we have Ez1:T−1 [P(¯ xT ) − P(x∗)] ≤ 1 2T ∑ T t=2 max { γ ηt − γ ηt−1 , 0 } R2 + 1 T ∑ T t=1 ηt 2 G2 + K T . Both methods have convergence rate O ( 1 √ T ) by letting ηt = η0 √ t for RDA-ADMM and ηt = η0/ √ t for SGD-ADMM. 19 / 42
  16. ༗ݶ࿨ͷ໰୊ ਖ਼ଇԽ͋Γ܇࿅ޡࠩͷ૒ର໰୊ A = [a1, a2, . . . ,

    an ] ∈ Rp×n. min w {1 n n ∑ i=1 fi (a⊤ i w) + ψ(B⊤w) } (P: ओ) = − min x∈Rn,y∈Rd {1 n n ∑ i=1 f ∗ i (xi ) + ψ∗ ( y n ) Ax + By = 0 } (D: ૒ର) ࠷దੑ৚݅: a⊤ i w∗ ∈ ∇f ∗ i (x∗ i ), 1 n y∗ ∈ ∇ψ(u)|u=B⊤w∗ , Ax∗ + By∗ = 0. ⋆ ֤࠲ඪ xi ͸֤؍ଌ஋ ai ʹରԠ. 20 / 42
  17. SDCA-ADMM ֦ுϥάϥϯδΞϯ: L(x, y, w) := ∑ n i=1 f

    ∗ i (xi ) + nψ∗(y/n) − ⟨w, Ax + By⟩ + ρ 2 ∥Ax + By∥2. SDCA-ADMM For each t = 1, 2, . . . Choose i ∈ {1, . . . , n} uniformly at random, and update y(t) ← arg min y { L(x(t−1), y, w(t−1)) + 1 2 ∥y − y(t−1)∥2 Q } x(t) i ← arg min xi ∈R { L([xi ; x(t−1) \i ], y(t), w(t−1)) + 1 2 ∥xi − x(t−1) i ∥2 Gi,i } w(t) ← w(t−1) − ξρ{n(Ax(t) + By(t))−(n − 1)(Ax(t−1) + By(t−1))}. Q, Gi,i ͸͋Δ৚݅Λຬͨ͢ਖ਼ఆ஋ରশߦྻɽ ֤ߋ৽Ͱ i-൪໨ͷ࠲ඪ xi ͷΈߋ৽ɽ w ͷߋ৽͸ؾΛ෇͚Δඞཁ͕͋Δɽ 21 / 42
  18. ઢܗճؼ σβΠϯߦྻ X = (Xij ) ∈ Rn×p. Y =

    [y1, . . . , yn ]⊤ ∈ Rn. ਅͷϕΫτϧ β∗ ∈ Rp: Ϟσϧ : Y = Xβ∗ + ξ. ϦοδճؼʢTsykonov ਖ਼ଇԽʣ ˆ β ← arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 +λn∥β∥2 2 . 24 / 42
  19. ઢܗճؼ σβΠϯߦྻ X = (Xij ) ∈ Rn×p. Y =

    [y1, . . . , yn ]⊤ ∈ Rn. ਅͷϕΫτϧ β∗ ∈ Rp: Ϟσϧ : Y = Xβ∗ + ξ. ϦοδճؼʢTsykonov ਖ਼ଇԽʣ ˆ β ← arg min β∈Rp 1 n ∥Xβ − Y ∥2 2 +λn∥β∥2 2 . ม਺ม׵: ɹ ਖ਼ଇԽ߲ͷͨΊɼˆ β ∈ Ker(X)⊥ɽͭ·Γɼˆ β ∈ Im(X⊤)ɽ ͋Δ ˆ α ∈ Rn ͕ଘࡏͯ͠ɼˆ β = X⊤ ˆ α ͱॻ͚Δɽ (౳Ձͳ໰୊) ˆ α ← arg min α∈Rn 1 n ∥XX⊤α − Y ∥2 2 + λnα⊤(XX⊤)α. ˞ (XX⊤)ij = x⊤ i xj ΑΓɼ؍ଌ஋ xi ͱ xj ͷ಺ੵ͑͞ܭࢉͰ͖Ε͹Α͍ɽ 24 / 42
  20. ϦοδճؼͷΧʔωϧԽ Ϧοδճؼʢม਺ม׵൛ʣ ˆ α ← arg min α∈Rn 1 n

    ∥(XX⊤)α − Y ∥2 2 + λnα⊤(XX⊤)α. ˞ (XX⊤)ij = x⊤ i xj ͸αϯϓϧ xi ͱ xj ͷ಺ੵɽ 25 / 42
  21. ϦοδճؼͷΧʔωϧԽ Ϧοδճؼʢม਺ม׵൛ʣ ˆ α ← arg min α∈Rn 1 n

    ∥(XX⊤)α − Y ∥2 2 + λnα⊤(XX⊤)α. ˞ (XX⊤)ij = x⊤ i xj ͸αϯϓϧ xi ͱ xj ͷ಺ੵɽ • Χʔωϧ๏ͷΞΠσΟΞ x ͷؒͷ಺ੵΛଞͷඇઢܗͳؔ਺Ͱஔ͖׵͑Δ: x⊤ i xj → k(xi , xj ). ͜ͷ k : Rp × Rp → R ΛΧʔωϧؔ਺ͱݺͿ.   Χʔωϧؔ਺ͷຬͨ͢΂͖৚݅ ରশੑ: k(x, x′) = k(x′, x). ਖ਼஋ੑ: ∑ m i=1 ∑ m j=1 αi αj k(xi , xj ) ≥ 0, (∀{xi }m i=1 , {αi }m i=1 , m).   ٯʹ͜ͷੑ࣭Λຬͨؔ͢਺ͳΒԿͰ΋Χʔωϧ๏Ͱ༻͍ͯྑ͍ɽ 25 / 42
  22. ΧʔωϧϦοδճؼ ΧʔωϧϦοδճؼ: K = (k(xi , xj ))n i,j=1 ͱͯ͠ɼ

    ˆ α ← arg min β∈Rn 1 n ∥Kα − Y ∥2 2 + λnα⊤Kα. ৽͍͠ೖྗ x ʹରͯ͠͸ɼ y = n ∑ i=1 k(x, xi )ˆ αi Ͱ༧ଌɽ 26 / 42
  23. ΧʔωϧϦοδճؼ ΧʔωϧϦοδճؼ: K = (k(xi , xj ))n i,j=1 ͱͯ͠ɼ

    ˆ α ← arg min β∈Rn 1 n ∥Kα − Y ∥2 2 + λnα⊤Kα. ৽͍͠ೖྗ x ʹରͯ͠͸ɼ y = n ∑ i=1 k(x, xi )ˆ αi Ͱ༧ଌɽ Χʔωϧؔ਺ ⇔ ࠶ੜ֩ώϧϕϧτۭؒ (RKHS) k(x, x′) Hk ͋Δ ϕ(x) : Rp → Hk ͕ଘࡏͯ͠ɼ k(x, x′) = ⟨ϕ(x), ϕ(x′)⟩Hk ɽ ΧʔωϧτϦοΫ: ⟨ ∑ n i=1 αi ϕ(xi ), ϕ(x)⟩Hk = ∑ n i=1 αi k(xi , x). ˠΧʔωϧؔ਺ͷ஋͑͞ܭࢉͰ͖Ε͹ྑ͍ɽ 26 / 42
  24. ࠶ੜ֩ώϧϕϧτۭؒ (Reproducing Kernel Hilbert Space, RKHS) ೖྗσʔλͷ෼෍ɿPX ɼରԠ͢Δ L2 ۭؒɿL2

    (PX ) = {f | EX∼PX [f (X)2] < ∞}. Χʔωϧؔ਺͸ҎԼͷΑ͏ʹ෼ղͰ͖Δ (Steinwart and Scovel, 2012): k(x, x′) = ∞ ∑ j=1 µj ej (x)ej (x′). (ej )∞ j=1 ͸ L2 (PX ) ಺ͷਖ਼ن௚ަجఈ: ∥ej ∥L2(PX ) = 1, ⟨ej , ej′ ⟩L2(PX ) = 0 (j ̸= j′). µj ≥ 0. Definition (࠶ੜ֩ώϧϕϧτۭؒ (Hk )) ⟨f , g⟩Hk := ∑ ∞ j=1 1 µj αj βj for f = ∑ ∞ j=1 αj ej , g = ∑ ∞ j=1 βj ej ∈ L2 (PX ). ∥f ∥Hk := √ ⟨f , f ⟩Hk . Hk := {f ∈ L2 (PX ) | ∥f ∥Hk < ∞} equipped with ⟨·, ·⟩Hk . ࠶ੜੑ: f ∈ Hk ʹରͯ͠ f (x) ͸಺ੵͷܗͰʮ࠶ੜʯ͞ΕΔ: f (x) = ⟨f , k(x, ·)⟩Hk . 27 / 42
  25. ࠶ੜ֩ώϧϕϧτۭؒͷੑ࣭ ϕk (x) = k(x, ·) ∈ Hk ͱॻ͚͹ɼk(x, x′)

    = ⟨ϕk (x), ϕk (x′)⟩Hk ͱॻ͚Δɽ͜ͷ ϕk Λಛ௃ࣸ૾ͱ΋ݴ͏ɽ Χʔωϧؔ਺ʹରԠ͢Δੵ෼࡞༻ૉ Tk : L2 (PX ) → L2 (PX ): Tk f := ∫ f (x)k(x, ·)dPX (x). ઌͷΧʔωϧؔ਺ͷ෼ղ͸ Tk ͷεϖΫτϧ෼ղʹରԠɽ ࠶ੜ֩ώϧϕϧτۭؒ Hk ͸ҎԼͷΑ͏ʹ΋ॻ͚Δ: Hk = T1/2 k L2 (PX ). ∥f ∥Hk = inf{∥h∥L2(PX ) | f = T1/2 k h, h ∈ L2 (PX )}. f ∈ Hk ͸ f (x) = ∑ ∞ j=1 aj √ µj ej (x) ͱॻ͚ͯɼ∥f ∥Hk = √∑ ∞ j=1 a2 j ɽ (ej )j ͸ L2 ಺ͷਖ਼ن௚ަجఈɼ( √ µj ej )j ͸ RKHS ಺ͷ׬શਖ਼ن௚ަجఈɽ ಛ௃ࣸ૾ ϕk (x) = k(x, ·) ∈ Hk Λ׬શਖ਼ن௚ަجఈʹؔ͢Δ܎਺Ͱදݱ͢Δͱ ϕk (x) = ( √ µ1 e1 (x), √ µ2 e2 (x), . . . )⊤ 28 / 42
  26. ΧʔωϧϦοδճؼͷ࠶ఆࣜԽ ࠶ੜੑ: f ∈ Hk ʹର͠ f (x) = ⟨f

    , ϕ(x)⟩Hk . ΧʔωϧϦοδճؼͷ࠶ఆࣜԽ ˆ f ← min f ∈Hk 1 n n ∑ i=1 (yi − f (xi ))2 + C∥f ∥2 Hk දݱఆཧ ∃αi ∈ R s.t. ˆ f (x) = n ∑ i=1 αi k(xi , x), ⇒ ∥ˆ f ∥Hk = √∑ n i,j=1 αi αj k(xi , xj ) = √ α⊤Kα. ͖͞΄ͲͷΧʔωϧϦοδճؼͷఆࣜԽͱҰகɽ 30 / 42
  27. Χʔωϧͷྫ Ψ΢γΞϯΧʔωϧ k(x, x′) = exp ( − ∥x −

    x′∥2 2σ2 ) ଟ߲ࣜΧʔωϧ k(x, x′) = ( 1 + x⊤x′ )p χ2-Χʔωϧ k(x, x′) = exp ( − γ2 ∑ d j=1 (xj −x′ j )2 (xj +x′ j ) ) Mat´ ern-kernel k(x, x′) = ∫ Rd eiλ⊤(x−x′) 1 (1 + ∥λ∥2)α+d/2 dλ άϥϑΧʔωϧɼ࣌ܥྻΧʔωϧɼ... 31 / 42
  28. ࠶ੜ֩ώϧϕϧτۭؒ಺ͷ֬཰త࠷దԽ ໰୊ઃఆ: yi = f o(xi ) + ξi .

    (xi , yi )n i=1 ͔Β f o Λਪఆ͍ͨ͠ɽ(f o ͸ Hk ʹ΄΅ೖ͍ͬͯΔ) ظ଴ଛࣦͷมܗ: E[(f (X) − Y )2] = E[(f (X) − f o(X) − ξ)2] = E[(f (X) − f o(X))2] + σ2 ˠ minf ∈Hk E[(f (X) − Y )2] Λղ͚͹ f o ͕ٻ·Δɽ Kx = k(x, ·) ∈ Hk ͱ͢Δͱɼf (x) = ⟨f , Kx ⟩Hk ΑΓ L(f ) = E[(f (X) − Y )2] ͷ RKHS ಺Ͱͷ Frechet ඍ෼͸ҎԼͷ௨Γ: ∇L(f ) = 2E[KX (⟨KX , f ⟩Hk − Y )] = 2(E[KX K∗ X ] =:Σ f − E[KX Y ]) = 2(Σf − E[KX Y ]). ظ଴ଛࣦͷޯ഑๏: f ∗ t = f ∗ t−1 − η2(Σf ∗ t−1 − E[KX Y ]). ܦݧଛࣦͷޯ഑๏ (E[·] ͸ඪຊฏۉ): ˆ ft = ˆ ft−1 − η2(Σˆ ft−1 − E[KX Y ]). ֬཰తޯ഑ʹΑΔߋ৽: gt = gt−1 − η2(Kxit K∗ xit gt−1 − Kxit yit ). ˞ (xit , yit )∞ t=1 ͸ (xi , yi )n i=1 ͔Β i.i.d. Ұ༷ʹऔಘɽ 32 / 42
  29. ޯ഑ͷεϜʔδϯάͱͯ͠ͷݟํ ؔ਺஋ͷߋ৽ࣜ: f ∗ t (x) = f ∗ t−1

    (x) − 2η ∫ k(x, X) (f ∗ t−1 (X) − Y ) →f ∗ t−1 (X)−f o(X) dP(X, Y ) = f ∗ t−1 (x) − 2ηTk (f ∗ t−1 − f o)(x). ੵ෼࡞༻ૉ Tk ͸ߴप೾੒෼Λ཈੍͢Δ࡞༻͕͋Δɽ RKHS ಺ͷޯ഑͸ L2 ಺ͷؔ਺ޯ഑ΛTk ʹΑͬͯฏ׈Խͨ͠΋ͷʹͳ͍ͬͯ Δɽ(࣮ࡍ͸ Tk ͷαϯϓϧ͔Βͷਪఆ஋Λ࢖͏) ߴप೾੒෼͕ग़ͯ͘ΔલʹࢭΊΕ͹աֶशΛ๷͛Δɽ ˠ Early stopping ӌᮣʹ Newton ๏ͳͲΛ࢖͏ͱةݥɽ 33 / 42
  30. Early stopping ʹΑΔਖ਼ଇԽ Early stopping ʹΑΔਖ਼ଇԽ ॳظ஋ ܇࿅ޡࠩ࠷খԽݩ ʢաֶशʣ &BSMZTUPQQJOH

    όΠΞε-όϦΞϯε෼ղ ∥f o − ˆ f ∥L2(PX ) Estimation error ≤ ∥f o − ˇ f ∥L2(PX ) Approximation error (bias) + ∥ˇ f − ˆ f ∥L2(PX ) Sample deviation (variance) ܇࿅ޡࠩ࠷খԽݩʹୡ͢ΔલʹࢭΊΔ (early stopping) ͜ͱͰਖ਼ଇԽ͕ಇ͘ɽ ແݶ࣍ݩϞσϧ (RKHS) ͸աֶश͠΍͍͢ͷͰؾΛ෇͚Δඞཁ͕͋Δɽ 34 / 42
  31. ղੳʹ༻͍Δ৚݅ ௨ৗɼҎԼͷ৚݅Λߟ͑Δɽ ʢ౷ܭཧ࿦Ͱ΋ಉ༷ͷԾఆΛ՝͢ఆ൪ͷԾఆʣ (Caponnetto and de Vito, 2007, Dieuleveut et

    al., 2016, Pillaud-Vivien et al., 2018) µi = O(i−α) for α > 1. α ͸ RKHSHk ͷෳࡶ͞Λಛ௃͚ͮΔɽ(খ͍͞ α: ෳࡶɼେ͖͍ α: ୯७) f o ∈ Tr (L2 (PX )) for r > 0. f o ͕ RKHS ͔ΒͲΕ͚ͩ “͸Έग़͍ͯΔ͔” Λಛ௃͚ͮɽ r = 1/2 ͸ f o ∈ Hk ʹରԠɽ(r < 1/2: ͸Έग़ͯΔ, r ≥ 1/2: ؚ·ΕΔ) ∥f ∥L∞(PX ) ≲ ∥f ∥1−µ L2(PX ) ∥f ∥µ Hk (∀f ∈ Hk ) for µ ∈ (0, 1]. Hk ʹؚ·Ε͍ͯΔؔ਺ͷ׈Β͔͞Λಛ௃͚ͮɽ ʢখ͍͞ µ: ׈Β͔ʣ ˞ ࠷ޙͷ৚݅ʹ͍ͭͯ: f ∈ W m([0, 1]d ) (Sobolev ۭؒ) ͔ͭ PX ͷ୆͕ [0, 1]d Ͱີ౓ؔ਺Λ࣋ͪɼͦͷີ౓͕Լ͔Β͋Δఆ਺ c > 0 Ͱ཈͑ΒΕ͍ͯΕ͹ɼ µ = d/(2m) ͰͳΓͨͭɽ 35 / 42
  32. ऩଋϨʔτ όΠΞε-όϦΞϯεͷ෼ղ: ∥f o − gt∥2 L2(PX ) ≲ ∥f

    o − f ∗ t ∥2 L2(PX ) (a): Bias + ∥f ∗ t − ˆ ft∥2 L2(PX ) (b): Variance + ∥ˆ ft − gt∥2 L2(PX ) (c): SGD deviation (a) (ηt)−2r , (b) (ηt)1/α+(ηt)µ−2r n , (c) η(ηt)1/α−1 (a) ޯ഑๏ͷղͷσʔλʹؔ͢Δظ଴஋ͱਅͷؔ਺ͱͷζϨ (Bias)ɽ (b) ޯ഑๏ͷղͷ෼ࢄ (Variance)ɽ (c) ֬཰తޯ഑Λ༻͍Δ͜ͱʹΑΔมಈ. ߋ৽਺ t Λେ͖͘͢Δͱ Bias ͸ݮΔ͕ Variance ͕૿͑Δɽ͜ΕΒΛόϥϯε͢ Δඞཁ͕͋Δ (Early stopping)ɽ Theorem (Multi-pass SGD ͷऩଋϨʔτ (Pillaud-Vivien et al., 2018)) η = 1/(4 supx k(x, x)2) ͱ͢Δɽ µα < 2rα + 1 < α ͷ࣌ɼt = Θ(nα/(2rα+1)) ͱ͢Ε͹ɼ E[L(gt )] − L(f o) = O(n−2rα/(2rα+1)). µα ≥ 2rα + 1 ͷ࣌ɼt = Θ(n 1 µ (log n) 1 µ ) ͱ͢Ε͹ɼE[L(gt )] − L(f o) = O(n−2r/µ). 36 / 42
  33. Natural gradient ͷऩଋ Natural gradient (ࣗવޯ഑๏): ˆ ft = ˆ

    ft−1 − η(Σ + λI)−1(Σˆ ft−1 − E[KX Y ]). (unlabeled data ͕୔ࢁ͋Γ Σ ͸ྑ͘ਪఆͰ͖Δઃఆ; GD ͷղੳ (Murata and Suzuki, 2020)) Theorem (Natural gradient ͷऩଋ (Amari et al., 2020)) E[∥ˆ ft − f o∥2 L2(PX ) ] ≲ B(t) + V (t), ͨͩ͠ɼB(t) = exp(−ηt) ∨ (λ/(ηt))2r , V (t) = (1 + ηt) λ−1B(t) + λ− 1 α n + (1 + tη)4 (1 ∨ λ2r−µ)λ− 1 α n . ಛʹɼλ = n− α 2rα+1 , t = Θ(log(n)) Ͱ E[∥ˆ ft − f o∥2 L2(PX ) ] = O(n− 2rα 2rα+1 log(n)4). ˞ όΠΞε͸ٸ଎ʹऩଋ͢Δ͕ɼόϦΞϯε΋଎͘૿େ͢Δɽ ˠ Preconditioning ͷͨΊߴप೾੒෼͕ૣΊʹग़ݱ͢ΔɽΑΓૣΊʹࢭΊ ͳ͍ͱաֶश͢Δɽ 37 / 42
  34. ࡞༻ૉ Bernstein ͷෆ౳ࣜ Σ = Ex [Kx K∗ x ]:

    Σf = ∫ k(·, x)f (x)dPx (x) Σ = 1 n ∑ n i=1 Kxi K∗ xi : Σf = 1 n ∑ n i=1 k(·, xi )f (xi ) Σλ := Σ + λIɼF∞ (λ) := supx K∗ x Σ−1 λ Kx ͱ͢ΔɽҎԼͷΑ͏ͳධՁ͕ඞཁ: ∥Σ−1 λ (Σ − Σ)Σ−1 λ ∥ ≲ √ F∞ (λ)β n + (1 + F∞ (λ))β n with prob. 1 − δɽͨͩ͠ɼβ = log(4Tr[ΣΣ−1 λ ] δ )ɽ ˠ ܦݧ෼෍ͱਅͷ෼෍ͷͣΕΛό΢ϯυɽ Theorem (ࣗݾڞ໾࡞༻ૉͷ Bernstein ͷෆ౳ࣜ (Minsker, 2017)) (Xi )n i=1 ͸ಠཱͳࣗݾڞ໾࡞༻ૉͷ֬཰ม਺Ͱ E[Xi ] = 0 ͔ͭɼ σ2 ≥ ∥ ∑ n i=1 E[X2 i ]∥, U ≥ ∥Xi ∥ ͱ͢Δɽr(A) = Tr[A]/∥A∥ ͱͯ͠ɼ P ( n ∑ i=1 Xi ≥ t ) ≤ 14r( ∑ n i=1 E[X2 i ]) exp ( − t2 2(σ2 + tU/3) ) . Xi = Σ−1 λ Kxi K∗ xi Σ−1 λ ͱ͢Δɽ (Tropp (2012) ΋ࢀর) 39 / 42
  35. ਖ਼ଇԽ͋Γͷ֬཰త࠷దԽ ೋ৐ଛࣦΛ֦ுͯ͠ɼҰൠͷ׈Β͔ͳತଛࣦؔ਺ ℓ Λߟ͑Δɽ ʢ൑ผ໰୊ͳͲʣ ਖ਼ଇԽ͋Γͷظ଴ଛࣦ࠷খԽ: min f ∈Hk E[ℓ(Y

    , f (X))] + λ∥f ∥2 Hk =: Lλ (f ). ͜ΕΛ SGD Ͱղ͘ɽ໨తؔ਺͕ λ-ڧತͰ͋Δ͜ͱΛར༻ɽ gt+1 = gt − ηt (ℓ′(yt, gt (xt )) + λgt ). ¯ gT+1 = T+1 ∑ t=1 2(c0+t−1) (2c0+T)(T+1) gt (ଟ߲ࣜฏۉ). Ծఆɿ(i) ℓ ͸ γ-ฏ׈ɼ∥ℓ′∥∞ ≤ M, (ii) k(x, x) ≤ 1. gλ = argming∈Hk Lλ (g). Theorem (Nitanda and Suzuki (2019)) ద੾ͳ c0 > 0 ʹରͯ͠ ηt = 2/(λ(c0 + t)) ͱ͢Ε͹ɼ E[Lλ (¯ gT+1 ) − Lλ (gλ )] ≲ M2 λ(c0 + T) + γ + λ T + 1 ∥g1 − gλ ∥2 Hk . ͞ΒʹϚϧνϯήʔϧ֬཰ूதෆ౳ࣜΑΓ High probability bound ΋ಘΒΕΔɽ ൑ผ໰୊ͳΒ strong low noise condition ͷ΋ͱ൑ผޡࠩͷࢦ਺ऩଋ΋ࣔͤΔɽ40 / 42
  36. Ϛϧνϯήʔϧ Hoeffding ͷෆ౳ࣜ Theorem (Ϛϧνϯήʔϧ Hoeffding ܕूதෆ౳ࣜ (Pinelis, 1994)) ֬཰ม਺ྻ:

    D1, . . . , DT ∈ Hk ɽE[Dt ] = 0ɼ∥Dt∥Hk ≤ Rt (a.s.) ͱ͢Δɽ ∀ϵ > 0 ʹର͠ P [ max 1≤t≤T ∥ t ∑ s=1 Ds∥Hk ≥ ϵ ] ≤ 2 exp ( − ϵ2 2 ∑ T t=1 R2 t ) . Dt = E[¯ gT+1|Z1, . . . , Zt ] − E[¯ gT+1|Z1, . . . , Zt−1 ], ͨͩ͠ Zt = (xt, yt ) ͱ͢Ε͹ɼ ∑ T t=1 Dt = ¯ gT+1 − E[¯ gT+1 ] ͱͳΓɼظ଴஋ͱ࣮ ݱ஋ͷͣΕΛ཈͑ΒΕΔɽ 41 / 42
  37. Ϛϧνϯήʔϧ Hoeffding ͷෆ౳ࣜ Theorem (Ϛϧνϯήʔϧ Hoeffding ܕूதෆ౳ࣜ (Pinelis, 1994)) ֬཰ม਺ྻ:

    D1, . . . , DT ∈ Hk ɽE[Dt ] = 0ɼ∥Dt∥Hk ≤ Rt (a.s.) ͱ͢Δɽ ∀ϵ > 0 ʹର͠ P [ max 1≤t≤T ∥ t ∑ s=1 Ds∥Hk ≥ ϵ ] ≤ 2 exp ( − ϵ2 2 ∑ T t=1 R2 t ) . Dt = E[¯ gT+1|Z1, . . . , Zt ] − E[¯ gT+1|Z1, . . . , Zt−1 ], ͨͩ͠ Zt = (xt, yt ) ͱ͢Ε͹ɼ ∑ T t=1 Dt = ¯ gT+1 − E[¯ gT+1 ] ͱͳΓɼظ଴஋ͱ࣮ ݱ஋ͷͣΕΛ཈͑ΒΕΔɽ (ิ଍) Lλ ͸ RKHS ϊϧϜʹؔͯ͠ λ-ڧತͰ͋Δ͜ͱΑΓɼ ∥¯ gT+1 − gλ ∥Hk ≤ O( 1 λ2T ) ͕ߴ͍֬཰Ͱ੒Γཱͭɽ࣮͸ ∥ · ∥∞ ≤ ∥ · ∥Hk Ͱ΋͋ΔͷͰɼ |P(Y = 1|X) − P(Y = −1|X)| ≥ δ ͳΔϚʔδϯ৚݅ (strong low noise condition) ͷ΋ͱɼ׬શͳ൑ผ͕ߴ͍֬཰ͰͰ͖ΔΑ͏ʹͳΔɽ 41 / 42
  38. Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex

    optimization. arXiv preprint arXiv:1603.05643, 2016. S. Amari, J. Ba, R. Grosse, X. Li, A. Nitanda, T. Suzuki, D. Wu, and J. Xu. When does preconditioning help or hurt generalization?, 2020. A. Caponnetto and E. de Vito. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007. A. R. Conn, N. I. Gould, and P. L. Toint. Trust region methods, volume 1. Siam, 2000. W. Deng and W. Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Technical report, Rice University CAAM TR12-14, 2012. A. Dieuleveut, F. Bach, et al. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Computers & Mathematics with Applications, 2:17–40, 1976. R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015. 42 / 42
  39. S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for

    nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numerical Analisis, 50(2):700–709, 2012. M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory & Applications, 4:303–320, 1969. M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. arXiv preprint arXiv:1208.3922, 2012a. M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. Technical report, 2012b. arXiv:1208.3922. C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732, 2017a. C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017b. Z. Li. Ssrgd: Simple stochastic recursive gradient descent for escaping saddle points. In Advances in Neural Information Processing Systems, pages 1523–1533, 2019. 42 / 42
  40. Z. Li and J. Li. A simple proximal stochastic gradient

    method for nonsmooth nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5564–5574. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7800-a-simple-proximal-stochastic-gradient-method-for-nonsmooth-n pdf. Y. Liu, F. Shang, and J. Cheng. Accelerated variance reduced stochastic admm. In Thirty-First AAAI Conference on Artificial Intelligence, pages 2287–2293, 2017. S. Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters, 127:111–119, 2017. J. F. Mota, J. M. Xavier, P. M. Aguiar, and M. P¨ uschel. A proof of convergence for the alternating direction method of multipliers applied to polyhedral-constrained functions. arXiv preprint arXiv:1112.2295, 2011. T. Murata and T. Suzuki. Gradient descent in rkhs with importance labeling, 2020. L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak´ aˇ c. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2613–2621, International Convention Centre, Sydney, Australia, 06–11 42 / 42
  41. Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/nguyen17b.html. A. Nitanda and T. Suzuki.

    Stochastic gradient descent with exponential convergence rates of expected classification errors. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1417–1426. PMLR, 16–18 Apr 2019. URL http://proceedings.mlr.press/v89/nitanda19a.html. H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, 2013. N. H. Pham, L. M. Nguyen, D. T. Phan, and Q. Tran-Dinh. Proxsarah: An efficient algorithmic framework for stochastic composite nonconvex optimization. Journal of Machine Learning Research, 21(110):1–48, 2020. L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, pages 8114–8124, 2018. I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, pages 1679–1706, 1994. M. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, pages 283–298. Academic Press, London, New York, 1969. 42 / 42
  42. S. J. Reddi, A. Hefny, S. Sra, B. P´ ocz´

    os, and A. Smola. Stochastic variance reduction for nonconvex optimization. arXiv preprint arXiv:1603.06160, 2016. R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1: 97–116, 1976. I. Steinwart and C. Scovel. Mercer ʟ s theorem on general domains: on the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363–417, 2012. J. Sun, Q. Qu, and J. Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015. T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning, pages 392–400, 2013. T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31th International Conference on Machine Learning, pages 736–744, 2014. J. A. Tropp. User-friendly tools for random matrices: An introduction. Technical report, 2012. S. Zheng and J. T. Kwok. Fast-and-light stochastic ADMM. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2407–2413, 2016. 42 / 42