Upgrade to Pro — share decks privately, control downloads, hide ads and more …

近接勾配法と確率的勾配降下法

 近接勾配法と確率的勾配降下法

機械学習における最適化理論と学習理論的側面@組合せ最適化セミナー2020

Taiji Suzuki

June 18, 2023
Tweet

More Decks by Taiji Suzuki

Other Decks in Technology

Transcript

  1. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 3 / 167
  2. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 4 / 167
  3. ػցֶशͷ໰୊ઃఆ ڭࢣ͋Γֶश ɹσʔλ͕ೖྗͱͦΕʹର͢Δϥϕϧͷ૊Ͱ༩͑ΒΕΔɽ ɹ৽͍͠ೖྗ͕དྷͨ࣌ʹରԠ͢ΔϥϕϧΛ༧ଌ͢Δ໰୊ɽ ɹ໰୊ͷྫɿճؼɼ൑ผ ɹɹɹɹɹ ( , 3) (

    , 5) ڭࢣͳֶ͠श ɹσʔλʹϥϕϧ͕෇͍͍ͯͳ͍ɽ ɹ໰୊ͷྫɿΫϥελϦϯάɼԻݯ෼཭ɼҟৗݕ஌ ൒ڭࢣ͋Γֶश ɹҰ෦ͷσʔλʹϥϕϧ͕෇͍͍ͯΔɽ ڧԽֶश ɹࢼߦࡨޡ͠ͳ͕Βࣗ෼ͰσʔλΛूΊΔɽ 5 / 167
  4. ػցֶशͷྲྀΕ ಛ௃நग़: ը૾ͳͲͷର৅ΛԿΒ͔ͷํ๏ͰϕΫτϧʹม׵ɽ(෼໺͝ͱͷ ϊ΢ϋ΢) Ұ౓ಛ௃ϕΫτϧʹม׵ͯ͠͠·͑͹͋ͱ͸౷ܭͷ໰୊ɽ x1 x2 xd ಛ௃நग़ ಛ௃ϕΫτϧ

    ෼ੳ ύϥϝʔλਪఆ   ༧ଌϞσϧͷߏங (θ: Ϟσϧͷύϥϝʔλ) ʢڭࢣ༗Γֶशʣ y = f (x; θ)   ˞ਂ૚ֶश͸ಛ௃நग़ͷ෦෼ΛωοτϫʔΫߏ଄Λ޻෉͢Δ͜ͱͰֶशʹ૊Έࠐ ΜͰ͍Δɽ 6 / 167
  5. ଛࣦؔ਺Λ༻͍ͨఆࣜԽ ڭࢣ͋Γ/ͳֶ͠शɼ͍ͣΕ΋ଛࣦؔ਺ͷ࠷খԽͱͯ͠ఆࣜԽͰ͖Δɽ σʔλͷߏ଄Λද͢ύϥϝʔλ θ ∈ Θ (Θ ͸Ծઆू߹ (Ϟσϧ)) ˡ

    ʮֶशʯ≈ θ ͷਪఆ ଛࣦؔ਺: ύϥϝʔλ θ ͕σʔλ z ΛͲΕ͚ͩΑ͘આ໌͍ͯ͠Δ͔; ℓ(z, θ). ൚Խޡࠩ (ظ଴ޡࠩ)ɿଛࣦͷظ଴஋ ˠ ൚Խޡࠩ࠷খԽ ≈ʮֶशʯ min θ∈Θ EZ [ℓ(Z, θ)]. ܇࿅ޡࠩʢܦݧޡࠩʣɿ؍ଌ͞ΕͨσʔλͰ୅༻, min θ∈Θ 1 n n ∑ i=1 ℓ(zi , θ). ˞ ܇࿅ޡࠩͱ൚Խޡࠩʹ͕ࠩ͋Δ͜ͱ͕ػցֶशʹ͓͚Δ࠷దԽͷಛ௃ɽ 7 / 167
  6. Ϟσϧͷྫ (ڭࢣ͋Γ) ճؼ z = (x, y) ∈ Rd+1 ℓ(z,

    θ) = (y − θ⊤x)2 (ೋ৐ޡࠩ) min θ∈Rd 1 n n ∑ i=1 ℓ(zi , θ) = min θ∈Rd 1 n n ∑ i=1 (yi − θ⊤xi )2 (࠷খೋ৐๏) zi 8 / 167
  7. ڭࢣ͋Γֶशͷଛࣦؔ਺ʢճؼʣ ͷσʔλ z = (x, y) ʹ͓͚Δ f = x⊤θ

    ͷଛࣦ. ೋ৐ଛࣦ: ℓ(y, f ) = 1 2 (y − f )2. τ-෼Ґ఺ଛࣦ: ℓ(y, f ) = (1 − τ) max{f − y, 0} + τ max{y − f , 0}. ͨͩ͠ɼτ ∈ (0, 1). ෼Ґ఺ճؼʹ༻͍ΒΕΔɽ ϵ-ײ౓ଛࣦ: ℓ(y, f ) = max{|y − f | − ϵ, 0}, ͨͩ͠ɼϵ > 0. αϙʔτϕΫτϧճؼʹ༻͍ΒΕΔɽ f-y -3 -2 -1 0 1 2 3 0 1 2 3 τ Huber ǫ 9 / 167
  8. ڭࢣ͋Γֶशͷଛࣦؔ਺ʢ൑ผʣ y ∈ {±1} ϩδεςΟ οΫଛࣦ: ℓ(y, f ) =

    log((1 + exp(−yf ))/2). ώϯδଛࣦ: ℓ(y, f ) = max{1 − yf , 0}. ࢦ਺ଛࣦ: ℓ(y, f ) = exp(−yf ). ฏ׈Խώϯδଛࣦ: ℓ(y, f ) =      0, (yf ≥ 1), 1 2 − yf , (yf < 0), 1 2 (1 − yf )2, (otherwise). yf -3 -2 -1 0 1 2 3 0 1 2 3 4 0-1 10 / 167
  9. ਖ਼ଇԽ๏ ී௨ͷϩεؔ਺ (ෛͷର਺໬౓) ࠷খԽ: min β n ∑ i=1 ℓ(yi

    , β⊤xi ). ਖ਼ଇԽ෇͖ଛࣦؔ਺࠷খԽ: min β n ∑ i=1 ℓ(yi , β⊤xi ) + ψ(β) ਖ਼ଇԽ߲ . ਖ਼ଇԽ߲ͷྫ: Ϧοδਖ਼ଇԽ (ℓ2 -ਖ਼ଇԽ): ψ(β) = λ∥β∥2 2 ℓ1 -ਖ਼ଇԽ: ψ(β) = λ∥β∥1 τϨʔεϊϧϜਖ਼ଇԽ: ψ(W ) = Tr[(W ⊤W )1/2] (W ∈ RN×M : ߦྻ) 12 / 167
  10. ਖ਼ଇԽ๏ ී௨ͷϩεؔ਺ (ෛͷର਺໬౓) ࠷খԽ: min β n ∑ i=1 ℓ(yi

    , β⊤xi ). ਖ਼ଇԽ෇͖ଛࣦؔ਺࠷খԽ: min β n ∑ i=1 ℓ(yi , β⊤xi ) + ψ(β) ਖ਼ଇԽ߲ . ਖ਼ଇԽ߲ͷྫ: Ϧοδਖ਼ଇԽ (ℓ2 -ਖ਼ଇԽ): ψ(β) = λ∥β∥2 2 ℓ1 -ਖ਼ଇԽ: ψ(β) = λ∥β∥1 τϨʔεϊϧϜਖ਼ଇԽ: ψ(W ) = Tr[(W ⊤W )1/2] (W ∈ RN×M : ߦྻ) ਖ਼ଇԽ߲ʹΑΓ෼ࢄ͕཈͑ΒΕɼաֶश͕๷͕ΕΔɽ ͦͷ෼ɼόΠΞε͕৐Δɽ ˠ ద੾ͳਖ਼ଇԽͷڧ͞ (λ) ΛબͿඞཁ͕͋Δɽ 12 / 167
  11. ਖ਼ଇԽͷྫɿ Ϧοδਖ਼ଇԽͱաֶश ଟ߲ࣜճؼʢ15 ࣍ଟ߲ࣜʣ min θ∈R15 1 n n ∑

    i=1 {yi − (β1 xi + β2 x2 i + · · · + β15 x15 i )}2 + λ∥β∥2 2 13 / 167
  12. ਖ਼ଇԽͷྫɿℓ1 -ਖ਼ଇԽʢεύʔεਪఆʣ ˆ β = arg min β∈Rp 1 n

    n ∑ i=1 (yi − x⊤ i β)2 + λ p ∑ j=1 |βj |. R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267–288. 14 / 167
  13. εύʔεੑͷԸܙ yi = x⊤ i β∗ + ϵi (i =

    1, . . . , n)ɽβ∗ : ਅͷϕΫτϧɽ ˆ β = arg min β∈Rp 1 n n ∑ i=1 (yi − x⊤ i β)2 + λ p ∑ j=1 |βj |. xi ∈ Rp (p ࣍ݩ), d = ∥β∗∥0 (ਅͷඇ 0 ཁૉͷ਺) ͱ͢Δɽ Theorem (Lasso ͷऩଋϨʔτ) ͋Δ৚݅ͷ΋ͱɼ͋Δఆ਺ C ͕ଘࡏͯ͠ ∥ˆ β − β∗∥2 2 ≤ C dlog(p) n . ˞શମͷ࣍ݩ p ͸͔͔ͨͩ O(log(p)) Ͱ͔͠Өڹ͠ͳ͍ ! ࣮࣭త࣍ݩ d ͕ࢧ഑తɽ (Lasso) d log(p) n ≪ p n (࠷খೋ৐๏) 15 / 167
  14. ੍ݶݻ༗஋৚݅ (Restricted eigenvalue condition) A = 1 n X⊤X ͱ͢Δɽ

    Definition (੍ݶݻ༗஋৚݅ (RE(k′, C))) ϕRE (k′, C) = ϕRE (k′, C, A) := inf J⊆{1,...,n},v∈Rp: |J|≤k′,C∥vJ ∥1≥∥vJc ∥1 v⊤Av ∥vJ ∥2 2 ʹର͠ɼϕRE > 0 ͕੒Γཱͭɽ ΄΅εύʔεͳϕΫτϧʹ੍ݶͯ͠ఆٛͨ͠࠷খݻ༗஋ɽ k′ = 2d Ͱ੒Γཱ͍ͬͯΕ͹Α͍ɽ ϥϯμϜͳ X ʹରͯ͠ߴ֬཰Ͱ੒Γཱͭ͜ͱ͕ࣔͤΔ: Johnson Lindenstrauss ͷิ୊ (Johnson et al., 1986, Dasgupta and Gupta, 1999, Rudelson and Zhou, 2013)ɽ J 相関が小さい 一次独立 16 / 167
  15. Ұ༷ό΢ϯυ f f* L(f) L(f ) = E[ℓ(Y , f

    (X)), L(f ) = 1 n ∑ n i=1 ℓ(yi , f (xi ))] 17 / 167
  16. Ұ༷ό΢ϯυ f f* L(f) L(f) ^ f ^ “ͨ·ͨ·” ͏·͍͘͘΍͕͍ͭΔ

    (աֶश) ͔΋͠Εͳ͍ɽ ࣮ࡍɼF ͕ෳࡶͳ৔߹ऩଋ͠ͳ͍ྫ͕ 17 / 167
  17. Rademacher ෳࡶ౓ (Ұ༷ό΢ϯυ) L(ˆ f ) − ˆ L(ˆ f

    ) ≤ sup f ∈F { L(f ) − ˆ L(f ) } ≤ (?)   Rademacher ෳࡶ౓: ϵ1, ϵ2, . . . , ϵn : Rademacher ม਺, i.e., P(ϵi = 1) = P(ϵi = −1) = 1 2 . R(ℓ ◦ F) := E{ϵi },{xi } [ sup f ∈F 1 n n ∑ i=1 ϵi ℓ(yi , f (xi )) ] . ରশԽ: (ظ଴஋ͷό΢ϯυ) E [ sup f ∈F |L(f ) − L(f )| ] ≤ 2R(ℓ ◦ F).   Rademacher ෳࡶ͞Λ཈͑Ε͹Ұ༷ό΢ϯυ͕ಘΒΕΔʂ جຊతʹɼR(ℓ ◦ F) ≤ O(1/ √ n) Ͱ཈͑ΒΕΔɽ ྫɿ F = {f (x) = x⊤β | β ∈ Rd , ∥β∥ ≤ 1} ͔ͭ ℓ ͕ 1-Lipshitz ࿈ଓͳ࣌ɼ R(ℓ ◦ F) ≤ O( √ d/n). 18 / 167
  18. ΧόϦϯάφϯόʔʢࢀߟʣ Rademacher ෳࡶ౓Λ཈͑ΔͨΊʹ༗༻ɽ ΧόϦϯάφϯόʔ: Ծઆू߹ F ͷෳࡶ͞ɾ༰ྔɽ ϵ-ΧόϦϯάφϯόʔ N(F, ϵ,

    d) ϊϧϜ d Ͱఆ·Δ൒ܘ ϵ ͷϘʔϧͰ F Λ෴͏ͨΊ ʹඞཁͳ࠷খͷϘʔϧͷ਺ɽ F ༗ݶݸͷݩͰ F Λۙࣅ͢Δͷʹ࠷௿ݶඞཁͳݸ਺ɽ Theorem (Dudley ੵ෼) ∥f ∥2 n := 1 n ∑ n i=1 f (xi )2 ͱ͢Δͱɼ R(F) ≤ C √ n EDn [∫ ∞ 0 √ log(N(F, ϵ, ∥ · ∥n ))dϵ ] . 19 / 167
  19. ہॴ Rademacher ෳࡶ͞ʢࢀߟʣ ہॴ Rademacher ෳࡶ͞: Rδ (F) := R({f

    ∈ F | E[(f − f ∗)2] ≤ δ}). ࣍ͷ৚݅ΛԾఆͯ͠ΈΔ. F ͸ 1 Ͱ্͔Β཈͑ΒΕ͍ͯΔ: ∥f ∥∞ ≤ 1 (∀f ∈ F). ℓ ͸ Lipschitz ࿈ଓ͔ͭڧತ: E[ℓ(Y , f (X))] − E[ℓ(Y , f ∗(X))] ≥ BE[(f − f ∗)2] (∀f ∈ F). Theorem (Fast learning rate (Bartlett et al., 2005)) δ∗ = inf{δ | δ ≥ Rδ (F)} ͱ͢Δͱɼ֬཰ 1 − e−t Ͱ L(ˆ f ) − L(f ∗) ≤ C ( δ∗ + t n ) . δ∗ ≤ R(F) ͸ৗʹ੒Γཱͭ (ӈਤࢀর). ͜ΕΛ Fast learning rate ͱݴ͏ɽ R± (F) ± ± ±* 20 / 167
  20. ਖ਼ଇԽͱ࠷దԽ Ϟσϧͷ੍ݶʹΑΔਖ਼ଇԽ Early stopping ʹΑΔਖ਼ଇԽ ਅͷؔ਺ Ϟσϧ ਪఆྔ όΠΞε όϦΞϯε

    όϦΞϯε ॳظ஋ ܇࿅ޡࠩ࠷খԽݩ ʢաֶशʣ &BSMZTUPQQJOH όΠΞε-όϦΞϯε෼ղ ∥f o − ˆ f ∥L2(PX ) Estimation error ≤ ∥f o − ˇ f ∥L2(PX ) Approximation error (bias) + ∥ˇ f − ˆ f ∥L2(PX ) Sample deviation (variance) ܇࿅ޡࠩ࠷খԽݩʹୡ͢ΔલʹࢭΊΔ (early stopping) ͜ͱͰਖ਼ଇԽ͕ಇ͘ɽ ˠ ਂ૚ֶशɼBoosting ͷৗ౟खஈɽ 21 / 167
  21. Early stopping ʹΑΔաֶशͷճආ Hands-On Machine Learning with Scikit-Learn and TensorFlow

    by Aurlien Gron. Chapter 4. Training Models. https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/ch04.html 22 / 167
  22. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 24 / 167
  23. ਖ਼ଇԽֶश๏ ܇࿅ޡࠩ࠷খԽ: min x∈Rp 1 n n ∑ i=1 ℓ(zi

    , x). ਖ਼ଇԽ෇͖܇࿅ޡࠩ࠷খԽ: min x∈Rp 1 n n ∑ i=1 ℓ(zi , x) + ψ(x). ͠͹Β͘ ℓ ͱ ψ ͸ತؔ਺Ͱ͋ΔͱԾఆ. 25 / 167
  24. ฏ׈ੑͱڧತੑ Definition ฏ׈ੑ: ޯ഑͕Ϧϓγοπ࿈ଓ: ∥∇f (x) − ∇f (x′)∥ ≤

    L∥x − x′∥. ⇒ f (x) ≤ f (y) + ⟨x − y, ∇f (y)⟩ + L 2 ∥x − y∥2. ڧತੑ: ∀θ ∈ (0, 1), ∀x, y ∈ dom(f ), µ 2 θ(1 − θ)∥x − y∥2 + f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y).    ฏ׈͕ͩ ڧತͰ͸ͳ͍ ฏ׈͔ͭ ڧತ ڧತ͕ͩ ฏ׈Ͱ͸ͳ͍ 26 / 167
  25. ฏ׈ੑͱڧತੑͷ૒ରੑ ฏ׈ੑͱڧತੑ͸ޓ͍ʹ૒ରͷؔ܎ʹ͋Δ. Theorem f : Rp → ¯ R Λਅดತؔ਺Ͱ͋Δͱ͢Δɽͦͷ࣌ɼҎԼ͕੒Γཱͭ:

    f ͕ L-ฏ׈ ⇐⇒ f ∗ ͕ 1/L-ڧತ. logistic loss its dual function    Smooth but not strongly convex Strongly convex but not smooth (gradient → ∞) f ∗(y) = sup x∈Rp {⟨x, y⟩ − f (x)}. 28 / 167
  26. Ұ࣍࠷దԽ๏ xt-1 xt xt+1 ؔ਺஋ f (x) ͱޯ഑ g ∈

    ∂f (x) ͷ৘ใͷΈΛ༻͍ͨ࠷దԽख๏. Ұճͷߋ৽ʹ͔͔Δܭࢉྔ͕ܰ͘ɼߴ࣍ݩ࠷దԽ໰୊ʹ༗༻ɽ χϡʔτϯ๏͸ೋ࣍࠷దԽख๏. 29 / 167
  27. ࠷ٸ߱Լ๏ f (x) = ∑ n i=1 ℓ(zi , x)

    ͱ͢Δ. min x f (x). (ྼ) ޯ഑๏ ඍ෼Մೳͳ f (x): xt = xt−1 − ηt ∇f (xt−1 ). 30 / 167
  28. ࠷ٸ߱Լ๏ f (x) = ∑ n i=1 ℓ(zi , x)

    ͱ͢Δ. min x f (x). (ྼ) ޯ഑๏ ྼඍ෼Մೳͳ f (x): gt ∈ ∂f (xt−1 ), xt = xt−1 − ηt gt . 30 / 167
  29. ࠷ٸ߱Լ๏ f (x) = ∑ n i=1 ℓ(zi , x)

    ͱ͢Δ. min x f (x). (ྼ) ޯ഑๏ (ಉ஋ͳදݱ) ྼඍ෼Մೳͳ f (x): xt = xt−1 − ηt gt = argmin x { 1 2ηt ∥x − (xt−1 − ηt gt )∥2 } = argmin x { ⟨x, gt ⟩ + 1 2ηt ∥x − xt−1∥2 } , ͨͩ͠ɼgt ∈ ∂f (xt−1 ). ۙ઀఺ΞϧΰϦζϜ: xt = argmin x { f (x) + 1 2ηt ∥x − xt−1 ∥2 } . Ұൠͷ৔߹: f (xt ) − f (x∗) ≤ 1 2 ∑ t k=1 ηk ∥x0 − x∗∥. f (x) ͕ڧತͷ৔߹: f (xt ) − f (x∗) ≤ 1 2η ( 1 1+ση ) t−1 ∥x0 − x∗∥2. 30 / 167
  30. ˒ ۙ઀ޯ഑๏ f (x) = ∑ n i=1 ℓ(zi ,

    x) ͱ͢Δ. min x f (x) + ψ(x). ۙ઀ޯ഑๏ xt = argmin x { ⟨x, gt⟩ + ψ(x) + 1 2ηt ∥x − xt−1∥2 } = argmin x { ηtψ(x) + 1 2 ∥x − (xt−1 − ηt gt )∥2 } ͨͩ͠, gt ∈ ∂f (xt−1 ). ߋ৽ଇ͸ۙ઀ࣸ૾Ͱ༩͑ΒΕΔ: prox(q| ˜ ψ) = argmin x { ˜ ψ(x) + 1 2 ∥x − q∥2 } → ۙ઀ࣸ૾ʹΑΓਖ਼ଇԽ߲ͷѱ͍ੑ࣭ͷӨڹΛճආ (ඍ෼ෆՄೳੑͳͲ). 32 / 167
  31. ۙ઀ࣸ૾ͷྫ̍ L1 ਖ਼ଇԽ: ψ(x) = λ∥x∥1 . xt = argmin

    x { ληt∥x∥1 + 1 2 ∥x − (xt−1 − ηt gt ) =:qt ∥2 } = argmin x { ∑p j=1 [ ληt|xj | + 1 2 (xj − qt,j )2 ]} ࠲ඪ͝ͱʹ෼͔Ε͍ͯΔʂ xt,j = STληt (qt,j ) (j ൪໨ͷཁૉ) ͨͩ͠ ST ͸Soft-Thresholding functionͱݺ͹ΕΔ: STC (q) = sign(q) max{|q| − C, 0}. → ॏཁͰͳ͍ཁૉΛ 0 ʹ͢Δ. 33 / 167
  32. ۙ઀ࣸ૾ͷྫ̎ τϨʔεϊϧϜ: ψ(X) = λ∥X∥tr = λ ∑ j σj

    (X) (ಛҟ஋ͷ࿨). Xt−1 − ηt Gt = U diag(σ1, . . . , σd )V , ͱಛҟ஋෼ղ͢Δͱɼ Xt = U    STληt (σ1 ) ... STλη (σd )    V . 34 / 167
  33. ۙ઀ޯ഑๏ͷऩଋ ڧತੑͱฏ׈ੑ͕ऩଋϨʔτΛܾΊΔɽ xt = prox(xt−1 − ηt gt|ηtψ(x)). f ͷੑ࣭

    µ-ڧತ ඇڧತ γ-ฏ׈ exp ( −t µ γ ) γ t ඇฏ׈ 1 µt 1 √ t εςοϓ෯ ηt ͸ద੾ʹબͿඞཁ͕͋Δ. ηt ͷઃఆ ڧತ ඇڧತ ฏ׈ 1 γ 1 γ ඇฏ׈ 2 µt 1 √ t ࠷దͳऩଋϨʔτΛಘΔͨΊʹ͸దٓ {xt}t ͷฏۉΛऔΔඞཁ͕͋Δɽ Polyak-Ruppert ฏۉԽ, ଟ߲ࣜฏۉԽ. 35 / 167
  34. ۙ઀ޯ഑๏ͷऩଋ ڧತੑͱฏ׈ੑ͕ऩଋϨʔτΛܾΊΔɽ xt = prox(xt−1 − ηt gt|ηtψ(x)). f ͷੑ࣭

    µ-ڧತ ඇڧತ γ-ฏ׈ exp ( −t √ µ γ ) γ t2 ඇฏ׈ 1 µt 1 √ t εςοϓ෯ ηt ͸ద੾ʹબͿඞཁ͕͋Δ. ηt ͷઃఆ ڧತ ඇڧತ ฏ׈ 1 γ 1 γ ඇฏ׈ 2 µt 1 √ t ࠷దͳऩଋϨʔτΛಘΔͨΊʹ͸దٓ {xt}t ͷฏۉΛऔΔඞཁ͕͋Δɽ Polyak-Ruppert ฏۉԽ, ଟ߲ࣜฏۉԽ. ฏ׈ͳଛࣦͳΒ Nesterov ͷՃ଎๏ʹΑΓऩଋΛվળͰ͖Δ. → ࠷దͳऩଋϨʔτ 35 / 167
  35. Nesterov ͷՃ଎๏ (ඇڧತ) minx {f (x) + ψ(x)} Ծఆ: f

    (x) ͸ γ-ฏ׈. Nesterov ͷՃ଎๏ s1 = 1, η = 1 γ ͱ͢Δɽt = 1, 2, . . . ͰҎԼΛ܁Γฦ͢: 1 gt = ∇f (yt ) ͱͯ͠ xt = prox(yt − ηgt|ηψ) ͱߋ৽. 2 st+1 = 1+ √ 1+4s2 t 2 ͱઃఆ. 3 yt+1 = xt + ( st −1 st+1 ) (xt − xt−1 ) ͱߋ৽. f ͕ γ-ฏ׈ͳΒ͹ɼ f (xt ) − f (x∗) ≤ 2γ∥x0 − x∗∥2 t2 . Fast Iterative Shrinkage Thresholding Algorithm (FISTA) (Beck and Teboulle, 2009) ͱ΋ݺ͹Ε͍ͯΔ. εςοϓαΠζ η = 1/γ ͸όοΫτϥοΩϯάͰܾఆͰ͖Δ. ਂ૚ֶशͰ࢖ΘΕ͍ͯΔ “ϞʔϝϯλϜ” ๏΋ࣅͨΑ͏ͳํ๏ (Sutskever et al., 2013). 36 / 167
  36. Nesterov ͷՃ଎๏ͷղऍ Ճ଎๏ͷղऍ͸༷ʑͳํ޲͔Βͳ͞Ε͖ͯͨɽͦͷதͰ΋ɼAhn (2020) ʹΑΔ ࠷ۙͷ݁Ռ͸ཧղ͠΍͍͢ɽ ۙ઀఺ΞϧΰϦζϜ: xt+1 = argminx

    {f (x) + 1 2ηt+1 ∥x − xt∥2} (ྑ͍ऩଋ). ˠ̎छྨͷۙࣅ: gt = ∇f (yt ), f (yt ) + ⟨gt, x − yt⟩ ≤ f (x) ≤ f (yt ) + ⟨gt, x − yt⟩ + γ 2 ∥x − yt∥2. ̎छྨͷۙࣅΛ༻͍ͨަޓ࠷దԽ: zt+1 = argmin z { f (yt ) + ⟨∇f (yt ), z − yt⟩ + 1 2ηt+1 ∥z − zt∥2 } , yt+1 = argmin y { f (yt ) + ⟨∇f (yt ), y − yt⟩ + γ 2 ∥y − yt∥2 + 1 2ηt+1 ∥y − zt+1∥2 } .   yt = 1/γ 1/γ+1/ηt zt + 1/γ 1/γ+1/ηt xt, zt+1 = zt − ηt+1∇f (yt ), xt+1 = yt − 1 γ ∇f (yt ).   ≃   yt = 1/γ 1/γ+1/ηt zt + 1/γ 1/γ+1/ηt xt, xt+1 = yt − 1 γ ∇f (yt ), zt+1 = xt+1 + γηt (xt+1 − xt ).   ɹ ◦ (γηt+1 + 1/2)2 = (γηt + 1)2 + 1/4 ͱ͢Ε͹ɼݩͷߋ৽ࣜΛಘΔ (ηt = Θ(t))ɽ ◦ ࠨͷߋ৽ࣜͰ΋ O(1/t2) Λୡ੒ɽ 38 / 167
  37. Nesterov ͷՃ଎๏ (ڧತ) Ϧελʔτ๏ ͋Δఔ౓ Nesterov ͷՃ଎๏Ͱߋ৽Λ܁Γฦͨ͠ΒɼॳظԽ͠ͳ͓ͯ͠Ϧελʔ τ͢Δɽ ௚઀Ճ଎͢Δόʔδϣϯ΋͋Δ͕ɼ৚͕݅ऑͯ͘ࡁΉ (Ұ఺ڧತੑ)ɼ

    Ϧελʔτ൛ͷํ͕ݟ௨͕͠ Α͍ɼ࣮૷΋ָɽ Ϧελʔτͷن४ t ≥ √ 8γ µ ճߋ৽ͨ͠ΒϦελʔτ (Excess risk ͕ 1/2 ʹͳΔͨΊ) ໨తؔ਺͕Ұ౓্ঢͨ͠ΒϦελʔτ (yt+1 − xt−1 )⊤(xt − xt−1 ) ≥ 0 ͱͳͬͨΒϦελʔτ ୈೋɼୈࡾͷํ๏͸ώϡʔϦεςΟΫεɽexp ( −t √ µ γ ) ͷऩଋϨʔτɽ Ϧελʔτ 41 / 167
  38. 20 40 60 80 100 120 140 160 180 200

    Number of iterations 10-10 10-8 10-6 10-4 10-2 100 102 P(w) - P(w*) Prox-Grad Nesterov(normal) Nesterov(restart) Nesterov’s acceleration vs. normal gradient descent Lasso: n = 8, 000, p = 500. 42 / 167
  39. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 43 / 167
  40. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 44 / 167
  41. ػցֶशʹ͓͚Δ֬཰త࠷దԽͷྺ࢙ 1951 Robbins and Monro Stochastic approximation ྵ఺໰୊ 1957 Rosenblatt

    ύʔηϓτϩϯ 1978 Nemirovskii and Yudin ׈Β͔Ͱͳ͍ؔ਺ʹ͓͚Δ 1983 ϩόετͳํࡦ͓Αͼ࠷దੑ 1988 Ruppert ׈Β͔ͳؔ਺ʹ͓͚Δϩόετͳ 1992 Polyak and Juditsky εςοϓαΠζ΍ฏۉԽͷํࡦ 1998 Bottou ΦϯϥΠϯܕ֬཰త࠷దԽʹΑΔ 2004 Bottou and LeCun େن໛ػցֶश 2009- 2012 Singer and Duchi; Duchi et al.; Xiao FOBOS, AdaGrad, RDA 2012- Le Roux et al. όονܕख๏, ઢܗऩଋ 2013 Shalev-Shwartz and Zhang (SAG,SDCA,SVRG) Johnson and Zhang 2016 Allen-Zhu Katyusya όονܕख๏ͷՃ଎ 2017- ֤छඇತ࠷దԽख๏ͷൃల 45 / 167
  42. ֬཰త࠷దԽ๏ͱ͸ ໨తؔ਺: F(x) = EZ [ℓ(Z, x)] F ࣗମͰ͸ͳ͘ɼℓ(Z, x)

    ΛαϯϓϦϯά͢Δ͜ͱ͔͠Ͱ͖ͳ͍ঢ়گͰ F Λ࠷খ Խ͢Δ໰୊ʢ֬཰తܭը໰୊ʣΛղ͘ख๏ɼ·ͨ͸ҙਤతʹϥϯμϜωεΛ༻͍ ͯ F Λ࠷దԽ͢Δख๏ɽػցֶशͰ͸ F ͕ཅʹܭࢉͰ͖Δঢ়گͰ΋Θ͟ͱϥϯ μϜωεΛར༻ͯ͠ղ͘͜ͱ΋ଟ͍ɽ ΦϯϥΠϯܕ σʔλ͸͔࣍Β࣍΁ͱདྷΔ. جຊతʹ֤܇࿅σʔλ͸Ұճ͔͠࢖Θͳ͍. min x EZ [ℓ(Z, x)] όονܕ σʔλ਺ݻఆ. ܇࿅σʔλ͸Կ౓΋࢖ͬͯྑ͍͕ɼn ͕େ͖͍ঢ়گΛ૝ఆɽ ∑ n i=1 · ͸ͳΔ΂͘ ܭࢉͨ͘͠ͳ͍ɽ min x 1 n n ∑ i=1 ℓ(zi , x) 46 / 167
  43. ΦϯϥΠϯܕ֬཰త࠷దԽͷ໨తؔ਺ ℓ(z, x) Λ؍ଌ z ʹର͢Δύϥϝʔλ x ͷଛࣦ. (ظ଴ଛࣦ) L(x)

    = EZ [ℓ(Z, x)] or (ਖ਼ଇԽ෇͖ظ଴ଛࣦ) Lψ (x) = EZ [ℓ(Z, x)] + ψ(x) ؍ଌ஋ Z ͷ෼෍͸ঢ়گʹΑ͍ͬͯΖ͍Ζ ਅͷ෼෍ (ͭ·Γ L(x) = ∫ ℓ(Z, x)dP(Z) ͷ࣌) → L(x) ͸൚Խޡࠩ. ΦϯϥΠϯܕ࠷దԽ͸ͦΕࣗମֶ͕श! ڊେετϨʔδʹهԱ͞Ε͍ͯΔσʔλͷܦݧ෼෍ (ͭ·Γ L(x) = 1 n ∑ n i=1 ℓ(zi , x) ͷ࣌) → L (·ͨ͸ Lψ ) ͸ (ਖ਼ଇԽ͋Γͷ) ܇࿅ޡࠩ. 47 / 167
  44. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 48 / 167
  45. ֬཰తޯ഑߱Լ๏ (Stochastic Gradient Descent, SGD) SGD (ਖ਼ଇԽͳ͠) zt ∼ P(Z)

    Λ؍ଌɽℓt (x) := ℓ(zt, x) ͱ͢Δ. ʢ͚͕ͩ͜͜ී௨ͷޯ഑๏ͱҧ͏఺ʣ ଛࣦؔ਺ͷྼඍ෼Λܭࢉ: gt ∈ ∂x ℓt (xt−1 ). x Λߋ৽: xt = xt−1 − ηt gt. ֤൓෮ͰҰݸͷσʔλ zt Λ؍ଌ͢Ε͹ྑ͍. → ֤൓෮͝ͱʹ O(1) ͷܭࢉྔ (શσʔλ࢖͏ޯ഑๏͸ O(n)). σʔλશମ {zi }n i=1 Λ࢖Θͳ͍Ͱྑ͍ɽ Reminder: prox(q|ψ) := argminx { ψ(x) + 1 2 ∥x − q∥2 } . 49 / 167
  46. ֬཰తޯ഑߱Լ๏ (Stochastic Gradient Descent, SGD) SGD (ਖ਼ଇԽ͋Γ) zt ∼ P(Z)

    Λ؍ଌɽℓt (x) := ℓ(zt, x) ͱ͢Δ. ʢ͚͕ͩ͜͜ී௨ͷޯ഑๏ͱҧ͏ʣ ଛࣦؔ਺ͷྼඍ෼Λܭࢉ: gt ∈ ∂x ℓt (xt−1 ). x Λߋ৽: xt = prox(xt−1 − ηt gt|ηtψ). ֤൓෮ͰҰݸͷσʔλ zt Λ؍ଌ͢Ε͹ྑ͍. → ֤൓෮͝ͱʹ O(1) ͷܭࢉྔ (શσʔλ࢖͏ޯ഑๏͸ O(n)). σʔλશମ {zi }n i=1 Λ࢖Θͳ͍Ͱྑ͍ɽ Reminder: prox(q|ψ) := argminx { ψ(x) + 1 2 ∥x − q∥2 } . 49 / 167
  47. ॏཁͳ఺ ֬཰తޯ഑ͷظ଴஋͸ຊ౰ͷޯ഑ gt = ∇ℓt (xt−1 ) ΑΓɼ Ezt [gt

    ] = Ezt [∇ℓ(Z, xt−1 )] = ∇Ezt [ℓ(Z, xt−1 )] = ∇L(xt−1 ) ⇒ ֬཰తޯ഑͸ຊ౰ͷޯ഑ͷෆภਪఆྔ 50 / 167
  48. SGD ͷৼΔ෣͍ 0 0.2 0.4 0.6 0.8 1 -0.1 0

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SGD Batch 51 / 167
  49. SGD ͷऩଋղੳ Ծఆ (A1) E[∥gt∥2] ≤ G2. (A2) E[∥xt −

    x∗∥2] ≤ D2. (Ծఆ (A2) ͸ {x | ∥x∥ ≤ D/2} ͳΔू߹ʹఆٛҬΛݶΔ͜ͱͰอূ) Theorem ¯ xT = 1 T+1 ∑ T t=0 xt (Polyak-Ruppert ฏۉԽ) ͱ͢Δ. εςοϓαΠζΛ ηt = η0 √ t ͱ͢Ε͹ Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ η0 G2 + D2/η0 √ T . η0 = D G ͱ͢Ε͹, ظ଴ޡࠩͷ্ք͸ 2GD √ T . ͜Ε͸ϛχϚοΫε࠷ద (ఆ਺ഒΛআ͍ͯ). G ͸ ψ ͱؔ܎ͳ͍. ˠۙ઀ࣸ૾ͷ͓͔͛. ˞ L1 ਖ਼ଇԽͰ͸ ∥∂ψ(x)∥ ≤ C √ p Ͱ͋Δ. 52 / 167
  50. SGD ͷऩଋղੳ (ฏ׈ͳ໨తؔ਺) Ծఆ (A1’) L ͸ γ-ฏ׈ɼE[∥gt − E[gt

    ]∥2] = σ2. (A2) E[∥xt − x∗∥2] ≤ D2. (Ծఆ (A2) ͸ {x | ∥x∥ ≤ D/2} ͳΔू߹ʹఆٛҬΛݶΔ͜ͱͰอূ) Theorem ¯ xT = 1 T+1 ∑ T t=0 xt (Polyak-Ruppert ฏۉԽ) ͱ͢Δ. ∀µ′ > 0, Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ 1 2T T ∑ t=1 ( 1 ηt+1 − 1 ηt ) D2 + 1 2Tη1 D2 + 1 2 T ∑ t=1 ( γ − 1 2ηt ) ∥βt − βt+1∥2 + σ2 T T ∑ t=1 ηt ηt = D σ 1 √ t ͱ͔ͯͭ͠ 1 2γ > ηt ͳΒ, ظ଴ޡࠩͷ্ք͸ O( σD √ T ). σ2 = 0 (ϊΠζͳ͠) ͳΒɼηt = 1/(2γ) ͱ͢Δ͜ͱͰظ଴ޡࠩ = O( γ T D2) ͕ಘΒΕɼ௨ৗͷޯ഑๏ͷϨʔτ͕෮ݩ͞ΕΔ. 53 / 167
  51. ূ໌ ψ = 0 Ͱࣔ͢ɽ L(xt ) ≤ L(xt−1 )

    + ∇⊤L(xt−1 )(xt − xt−1 ) + γ 2 ∥xt − xt−1∥2 L(xt−1 ) + ∇⊤L(xt−1 )(x∗ − xt−1 ) ≤ L(x∗) ʹ஫ҙ͢Δɽ͜ͷೋࣜΛ଍͢͜ͱͰɼ L(xt ) − L(x∗) ≤ ∇⊤L(xt−1 )(xt − x∗) + γ 2 ∥xt − xt−1∥2 = ⟨gt + ϵt , xt − x∗⟩ + γ 2 ∥xt − xt−1∥2 ≤ − 1 ηt ⟨xt − xt−1, xt − x∗⟩ + ⟨ϵt , xt − xt−1 + xt−1 − x∗⟩ + γ 2 ∥xt − xt−1∥2 ≤ − 1 ηt ⟨xt − xt−1, xt − x∗⟩ + 1 4ηt ∥xt − xt−1∥2 + ηt ∥ϵt ∥2 + γ 2 ∥xt − xt−1∥2 + ⟨ϵt , xt−1 − x∗⟩ = 1 2ηt ( ∥xt−1 − x∗∥2 − ∥xt−1 − xt ∥2 − ∥xt − x∗∥2 ) + 1 4ηt ∥xt − x∗∥2 + ηt ∥ϵt ∥2 + γ 2 ∥xt − xt−1∥2 = 1 2ηt ( ∥xt−1 − x∗∥2 − ∥xt − x∗∥2 ) + 1 2 ( γ − 1 2ηt ) ∥xt−1 − xt ∥2 + [ηt ∥ϵt ∥2 + ⟨ϵt , xt−1 − x∗⟩] (← [·] ͷظ଴஋ ≤ ηt σ2 + 0). 54 / 167
  52. ূ໌ (ଓ) ͋ͱ͸྆ลظ଴஋औͬͯɼt = 1, . . . , T

    Ͱ଍͠߹ΘͤΕ͹Α͍ɽ ΄΅௨ৗͷޯ഑๏ͷධՁํ๏ͱಉ͕ͩ͡ɼϊΠζ͕৐ͬͨ෼͚ͩ σ2 T ∑ T t=1 ηt ͩ ͚ͣΕΔɽ ͜ͷޙग़ͯ͘Δ෼ࢄॖখޯ഑߱Լ๏ͳͲ΋جຊ͸͜ͷධՁࣜɽ 55 / 167
  53. SGD ͷऩଋղੳ (ڧತ) Ծఆ (A1) E[∥gt∥2] ≤ G2. (A3) Lψ

    ͸ µ-ڧತ. Theorem ¯ xT = 1 T+1 ∑ T t=0 xt ͱ͢Δ. εςοϓαΠζΛ ηt = 1 µt ͱ͢Ε͹ɼ Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ G2 log(T) Tµ . ඇڧತͳ৔߹ΑΓ΋଎͍ऩଋɽ ͔͠͠, ͜Ε͸ϛχϚοΫε࠷దͰ͸ͳ͍. ্քࣗମ͸λΠτ (Rakhlin et al., 2012). 56 / 167
  54. ڧತ໨తؔ਺ʹ͓͚Δଟ߲ࣜฏۉԽ Ծఆ (A1) E[∥gt∥2] ≤ G2. (A3) Lψ ͸ µ-ڧತ.

    ߋ৽ଇΛ xt = prox ( xt−1 − ηt t t + 1 gt|ηtψ ) , ͱ͠ɼॏΈ෇͖ฏۉΛऔΔ: ¯ xT = 2 (T+1)(T+2) ∑ T t=0 (t + 1)xt. Theorem ηt = 2 µt ʹର͠, Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ 2G2 Tµ Ͱ͋Δɽ log(T) ͕ফ͑ͨ. ͜Ε͸ϛχϚοΫε࠷ద. 57 / 167
  55. ҰൠԽͨ͠εςοϓαΠζͱՙॏํࡦ st (t = 1, 2, . . . ,

    T + 1) Λ ∑ T+1 t=1 st = 1 ͳΔ਺ྻͱ͢Δ. xt = prox ( xt−1 − ηt st st+1 gt|ηtψ ) (t = 1, . . . , T) ¯ xT = T ∑ t=0 st+1 xt. Ծఆ: (A1) E[∥gt ∥2] ≤ G2, (A2) E[∥xt − x∗∥2] ≤ D2, (A3) Lψ ͸ µ-ڧತ. Theorem Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ T ∑ t=1 st+1ηt+1 2 G2 + T−1 ∑ t=0 max{ st+2 ηt+1 − st+1 ( 1 ηt + µ), 0}D2 2 ͨͩ͠ t = 0 Ͱ͸ 1/η0 = 0 ͱ͢Δ. 58 / 167
  56. ಛผͳྫ ॏΈ st ΛεςοϓαΠζ ηt ʹൺྫͤͯ͞ΈΔ (εςοϓαΠζΛॏཁ౓ͱΈͳ ͢): st =

    ηt ∑ T+1 τ=1 ητ . ͜ͷઃఆͰ͸ɼલड़ͷఆཧΑΓ Ez1:T [Lψ (¯ xT ) − Lψ (x∗)] ≤ ∑ T t=1 η2 t G2 + D2 2 ∑ T t=1 ηt   ∞ ∑ t=1 ηt = ∞ ∞ ∑ t=1 η2 t < ∞ ͳΒ͹ऩଋɽԕ͘·Ͱ౸ୡͰ͖ͯɼ͔ͭద౓ʹݮ଎.   59 / 167
  57. ܭࢉྔͱ൚Խޡࠩͷؔ܎ ڧತͳ൚Խޡࠩͷ࠷దͳऩଋϨʔτ͸ O(1/n) (n ͸σʔλ਺). O(1/n) ͳΔ൚ԽޡࠩΛୡ੒͢Δʹ͸ɼ܇࿅ޡࠩ΋ O(1/n) ·Ͱݮগͤ͞ͳ ͍ͱ͍͚ͳ͍.

    ௨ৗͷޯ഑๏ SGD ൓෮͝ͱͷܭࢉྔ n 1 ޡࠩ ϵ ·Ͱͷ൓෮਺ log(1/ϵ) 1/ϵ ޡࠩ ϵ ·Ͱͷܭࢉྔ n log(1/ϵ) 1/ϵ ޡࠩ 1/n ·Ͱͷܭࢉྔ n log(n) n (Bottou, 2010) SGD ͸ O(log(n)) ͚ͩ௨ৗͷޯ഑๏ΑΓ΋ʮ଎͍ʯ. ʮn ݸσʔλݟΔ·Ͱݮগͤͣʯ vs ʮn ݸσʔλݟΕ͹ 1/n ·Ͱݮগʯ 61 / 167
  58. యܕతͳৼΔ෣͍ Elaplsed time (s) 0 5 10 15 20 25

    30 Trainig error 100 101 102 SGD Batch Elaplsed time (s) 0 5 10 15 20 25 30 Generalization error 0 5 10 15 20 25 Normal gradient descent vs. SGD L1 ਖ਼ଇԽ෇͖ϩδεςΟ οΫճؼ: n = 15, 000, p = 1, 000. ࠷ॳ SGD ͸܇࿅ޡࠩͱ൚ԽޡࠩΛͱ΋ʹૣ͘ݮগͤ͞Δɽ܇࿅ޡࠩ͸్தͰશσʔλΛ ࢖͏ޯ഑๏ʹ௥͍͔ͭΕൈ͔ΕΔɽ൚Խޡࠩ͸ SGD ͕༏Ґͳ͜ͱ͕ଟ͍ɽ 62 / 167
  59. SGD ʹର͢Δ Nesterov ͷՃ଎๏ Ծఆ: ظ଴ޡࠩ L(x) ͸ γ-ฏ׈. ޯ഑ͷ෼ࢄ͸

    σ2: EZ [∥∇β ℓ(Z, β) − ∇L(β)∥2] ≤ σ2. → Nesterov ͷՃ଎ʹΑͬͯɼΑΓ଎͍ऩଋΛ࣮ݱ. SGD ͷՃ଎: Hu et al. (2009) SDA ͷՃ଎: Xiao (2010), Chen et al. (2012) ඇತؔ਺΋ؚΊͨΑΓҰൠతͳ࿮૊Έ: Lan (2012), Ghadimi and Lan (2012, 2013) (ඇڧತ) Ez1:T [Lψ (x(T))] − Lψ (x∗) ≤ C ( σD √ T + D2γ T2 ) (ڧತ) Ez1:T [Lψ (x(T))] − Lψ (x∗) ≤ C ( σ2 µT + exp ( −C √ µ γ T )) 63 / 167
  60. Ճ଎ SGD ͷϛχόον๏ʹΑΔՃ଎ Ez1:T [Lψ (x(T))] − Lψ (x∗) ≤

    C ( σD √ T + D2γ T2 ) σ2 ͸ޯ഑ͷਪఆྔͷ෼ࢄ: EZ [∥∇β ℓ(Z, β) − ∇L(β)∥2] ≤ σ2. ෼ࢄ͸ϛχόον๏Λ༻͍Δ͜ͱͰݮগͤ͞Δ͜ͱ͕Մೳ: g = ∇ℓ(z, x(t−1)) ⇒ g = 1 K K ∑ k=1 ∇ℓ(zk , x(t−1)) (෼ࢄ) σ2 σ2/K K ݸͷޯ഑Λܭࢉ͢Δͷ͸ฒྻԽͰ͖Δ. σ → 0 Ͱ͖Ε͹ɼ্هͷ্ք͸ O(1/T2) ͱͳΔɽ͜Ε͸ඇ֬཰తͳ Nesterov ͷՃ଎๏ͱಉ͡ऩଋ଎౓ɽ 64 / 167
  61. (a) Objective for L1 (b) Objective for Elastic-net ਓ޻σʔλʹ͓͚Δ਺஋࣮ݧ (a)

    L1 ਖ਼ଇԽ (Lasso)ɼ(b) ΤϥεςΟ οΫωοτ ਖ਼ଇԽ (figure is from Chen et al. (2012)). SAGE: Accelerated SGD (Hu et al., 2009), AC-RDA: Accelerated stochastic RDA (Xiao, 2010), AC-SA: Accelerated stochastic approximation Ghadimi and Lan (2012), ORDA: Optimal stochastic RDA (Chen et al., 2012) 65 / 167
  62. ΦϯϥΠϯܕ֬཰త࠷దԽͷ·ͱΊ ֤ߋ৽ͰͨͬͨҰͭͷσʔλΛݟΔ͚ͩͰྑ͍ɽ ͋Δఔ౓ͷ൚ԽޡࠩΛಘΔ·Ͱ͕ૣ͍ɽ Nesterov ͷՃ଎๏ͱͷ૊߹ͤ΋Մೳ: Hu et al. (2009), Xiao

    (2010), Chen et al. (2012)ɽ AdaGrad (Duchi et al., 2011) ͳͲʹΑΔܭྔͷࣗಈௐઅ๏΋͋Δɽ ظ଴ޡࠩͷऩଋϨʔτ: GR √ T (ඇฏ׈, ඇڧತ) Polyak-Ruppert ฏۉԽ G2 µT (ඇฏ׈, ڧತ) ଟ߲ࣜฏۉԽ σR √ T + R2L T2 (ฏ׈, ඇڧತ) Ճ଎ σ2 µT + exp ( − √ µ L T ) (ฏ׈, ڧತ) Ճ଎ G: ޯ഑ͷϊϧϜͷ্ք, R: ఆٛҬͷ൒ܘ, L: ฏ׈ੑ, µ: ڧತੑ, σ: ޯ഑ͷ෼ࢄ 66 / 167
  63. ֬཰తҰ࣍࠷దԽ๏ͷϛχϚοΫε࠷దϨʔτ min x∈B L(x) = min x∈B EZ [ℓ(Z, x)]

    Condition ˆ gx ∈ ∂x ℓ(Z, x) ͷظ଴஋͸༗ք: ∥E[ˆ gx ]∥ ≤ G (∀x ∈ B). υϝΠϯ B ͸൒ܘ R ͷٿΛؚΉ. L(x) ͸ µ-ڧತ (µ = 0 ΋ڐ͢). Theorem (ϛχϚοΫε࠷దϨʔτ (Agarwal et al., 2012, Nemirovsky and Yudin, 1983)) ೚ҙͷ֬཰తҰ࣍࠷దԽ๏ʹର͠ɼ্هͷ৚݅Λຬͨ͋͢Δଛࣦؔ਺ ℓ ͱ෼෍ P(Z) ͕ଘࡏͯ͠ɼ E[L(x(T)) − L(x∗)] ≥ c min { GR √ T , G2 µT , GR √ p } . SGD ͸͜ͷ࠷దϨʔτΛୡ੒͢Δɽ Ұ࣍࠷దԽ๏: ଛࣦؔ਺ͷ஋͓ΑͼΫΤϦ఺ x ʹ͓͚Δޯ഑ (ℓ(Z, x), ˆ gx ) ʹͷΈ ґଘ͢ΔΞϧΰϦζϜ. (SGD ͸ؚ·Ε͍ͯΔ) 67 / 167
  64. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 68 / 167
  65. ΦϯϥΠϯͱόονͷҧ͍ όονܕͰ͸σʔλ͸ݻఆ. ͨͩ୯ʹ܇࿅ޡࠩΛ࠷খԽ͢Δ: P(x) = 1 n n ∑ i=1

    ℓi (x) + ψ(x).   όονܕख๏ ̍ճͷߋ৽ʹ̍σʔλͷΈར༻ (ΦϯϥΠϯܕͱಉ͡) ઢܗऩଋ͢Δ (ΦϯϥΠϯܕͱҧ͏) : T > (n + γ/λ)log(1/ϵ) ճͷߋ৽Ͱ ϵ ޡࠩΛୡ੒ɽͨͩ͠ γ-ฏ׈ͳଛࣦͱ λ-ڧತͳ ਖ਼ଇԽΛԾఆ. ˞௨ৗͷޯ഑๏͸ nγ/λ log(1/ϵ) ͳΔܭࢉྔɽ   ֬཰తख๏ͱඇ֬཰తख๏ͷྑ͍ͱ͜ͲΓΛ͍ͨ͠ɽ 69 / 167
  66. ୅දతͳࡾͭͷํ๏ ֬཰త෼ࢄॖখޯ഑๏ Stochastic Variance Reduced Gradient descent, SVRG (Johnson and

    Zhang, 2013, Xiao and Zhang, 2014) ֬཰తฏۉޯ഑๏ Stochastic Average Gradient descent, SAG (Le Roux et al., 2012, Schmidt et al., 2013, Defazio et al., 2014) ֬཰త૒ର࠲ඪ߱Լ๏ Stochastic Dual Coordinate Ascent, SDCA (Shalev-Shwartz and Zhang, 2013) - SAG ͱ SVRG ͸ओ໰୊Λղ͘ํ๏. - SDCA ͸૒ର໰୊Λղ͘ํ๏. 71 / 167
  67. ໰୊ઃఆ P(x) = 1 n ∑ n i=1 ℓi (x)

    ฏ׈ + ψ(x) ڧತ Ծఆ: ℓi : ଛࣦؔ਺͸ γ-ฏ׈. ψ: ਖ਼ଇԽؔ਺͸ λ-ڧತ. యܕతʹ͸ λ = O(1/n) ·ͨ͸ O(1/ √ n). ྫ: ଛࣦؔ਺ ฏ׈Խώϯδଛࣦ ϩδεςΟ οΫଛࣦ  ਖ਼ଇԽؔ਺ L2 ਖ਼ଇԽ ΤϥεςΟ οΫωοτਖ਼ଇԽ ˜ ψ(x) + λ∥x∥2  72 / 167
  68. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 73 / 167
  69. ओ໰୊Λѻ͏ํ๏ ΞΠσΟΞ: ޯ഑ͷ෼ࢄΛখ͘͢͞Δɽ 1 n n ∑ i=1 ℓi (x)

    Ͳ͏΍ͬͯۙࣅ͢Δ͔? ⟨g,x⟩ +ψ(x) ΦϯϥΠϯܕख๏: ˆ i ∈ {1, . . . , n} ΛϥϯμϜʹબ୒͠ɼઢܗۙࣅ.   g = ∇ℓˆ i (x) ⇒ E[g] = 1 n ∑ n i=1 ∇ℓi (x) ͭ·Γޯ഑ͷෆภਪఆྔ.   ෼ࢄ͸? → ෼ࢄ͕໰୊! → όονͷઃఆͰ͸෼ࢄΛॖখͤ͞Δ͜ͱ͕༰қ. 74 / 167
  70. ֬཰త෼ࢄॖখޯ഑๏ SVRG (Johnson and Zhang, 2013, Xiao and Zhang, 2014)

    minx {L(x) + ψ(x)} = minx {1 n ∑ n i=1 ℓi (x) + ψ(x)}   ͋Δ x ʹ͍ۙࢀর఺ ˆ x ʹର͠ɼ෼ࢄΛॖখͨ͠ޯ഑Λ࣍ͷΑ͏ʹٻΊΔ: g = ∇ℓi (x)−∇ℓi (ˆ x) + 1 n ∑ n j=1 ∇ℓj (ˆ x) ∇L(ˆ x) .   ภΓ: ෆภ E[g] = 1 n n ∑ i=1 [∇ℓi (x) − ∇ℓi (ˆ x) + ∇L(ˆ x)] = 1 n n ∑ i=1 ∇ℓi (x) = ∇L(x). ෼ࢄ͸? 75 / 167
  71. ෼ࢄͷॖখ g = ∇ℓi (x) − ∇ℓi (ˆ x) +

    ∇L(ˆ x). ෼ࢄ: Var[g] = 1 n ∑ n i=1 ∥∇ℓi (x) − ∇ℓi (ˆ x) + ∇L(ˆ x) − ∇L(x)∥2 = 1 n ∑ n i=1 ∇ℓi (x) − ∇ℓi (ˆ x)∥2 − ∥∇L(ˆ x) − ∇L(x) 2 (∵ Var[X] = E[∥X∥2] − ∥E[X]∥2) ≤ 1 n ∑ n i=1 ∥∇ℓi (x) − ∇ℓi (ˆ x)∥2 ≤ γ∥x − ˆ x∥2. ℓi ͕ฏ׈Ͱɼx ͱ ˆ x ͕͚ۙΕ͹෼ࢄ΋খ͍͞. 76 / 167
  72. SVRG ͷखॱ SVRG t = 1, 2, . . .

    ʹ͓͍ͯ, ҎԼͷखॱΛ܁Γฦ͢: 1 ˆ x = ˆ x(t−1), x[0] = ˆ x ͱͯ͠, ˆ g = ∇L(ˆ x) = 1 n ∑ n i=1 ∇ℓi (ˆ x). (શޯ഑) 2 k = 1, . . . , m ʹ͓͍ͯҎԼΛ࣮ߦ: 1 i ∈ {1, . . . , n} ΛҰ༷ʹαϯϓϧɽ 2 ෼ࢄॖখ͞Εͨޯ഑Λܭࢉ: g = ∇ℓi (x[k−1] ) − ∇ℓi (ˆ x) + ˆ g. (෼ࢄॖখ) 3 x[k] Λ࣍ͷΑ͏ʹߋ৽: x[k] = prox(x[k−1] − ηg|ηψ). 3 ˆ x(t) = 1 m ∑ m k=1 x[k] ͱ͢Δ. t ൓෮·Ͱͷޯ഑ͷܭࢉճ਺: t × (n + 2m). 78 / 167
  73. ऩଋղੳ Ծఆ: ℓi ͸ γ-ฏ׈Ͱɼψ ͸ λ-ڧತ. Theorem εςοϓαΠζ η

    ͱ಺෦൓෮ճ਺ m ͕ η < 1/(4γ) Λຬͨ͠ɼ ρ := η−1 λ(1−4γη)m + 4γ(m+1)η (1−4γη)m < 1, ͳΒ, T ൓෮ޙͷ໨తؔ਺஋͸࣍ͷΑ͏ʹ্͔Β཈͑ΒΕΔ: E[P(ˆ x(T)) − P(x∗)] ≤ ρT (P(ˆ x(0)) − P(x∗)). ఆཧͷԾఆ͸ m ͕࣍Λຬͨ͢͜ͱΛཁ੥: m ≥ Ω ( γ λ ) . ಺෦൓෮ͷܭࢉྔ: O(n + m) ޡࠩ ϵ ·Ͱͷ֎෦൓෮ͷճ਺: T = O(log(1/ϵ)) ⇒ શମతͳܭࢉྔ: O ((n + m) log(1/ϵ)) = O (( n + γ λ ) log(1/ϵ) ) 79 / 167
  74. ඇ֬཰తख๏ͱͷൺֱ E[P(x(T)) − P(x∗)] ≤ ϵ ·ͰͷܭࢉྔΛൺֱ. κ = γ/λ

    ͱ͢Δ (৚݅਺). SVRG: (n + κ) log (1/ϵ) Ω((n + κ) log (1/ϵ)) ճͷ൓෮ × ̍൓෮͝ͱ Ω(1) ඇ֬཰తޯ഑๏: nκ log (1/ϵ) Ω(κ log (1/ϵ)) ճͷ൓෮ × ̍൓෮͝ͱ Ω(n) σʔλ਺ n = 100, 000, ਖ਼ଇԽύϥϝʔλ λ = 1/1000, ฏ׈ੑ γ = 1: n×κ = 108, n+κ = 105. 1000 ഒ 80 / 167
  75. Catalyst: SVRG, SAG, SAGA ͷՃ଎ (൚༻తͳख๏) Catalyst (Lin et al.,

    2015) For t = 1, 2, . . . : 1 ڧತੑΛڧΊͨҎԼͷ໰୊Λղ͘: x(t) ≃ argminx { P(x) + α 2 ∥x − y(t−1)∥2 } (up to ϵt precision). 2 ղΛ “Ճ଎” ͢Δ: y(t) = x(t) + βt (x(t) − x(t−1)). − Catalyst ͸ۙࣅతۙ઀఺๏ͷՃ଎๏ʹ͋ͨΔɽ − ϵt = C(1 − √ λ/2(λ + α))t ͱ͢Ε͹ɼ P(x(t)) − P(x∗) ≤ C′ ( 1 − √ λ 2(λ+α) )t . − α = max{c γ n , λ} ͱͯ͠ SVRG, SAG, SAGA Λ಺෦ϧʔϓʹద༻͢Ε͹ɼશମ Ͱ (n + √ γn λ ) log(1/ϵ) ͷܭࢉྔͰࡁΉɽ − ൚༻తͳख๏Ͱ͸͋Δ͕ɼ಺෦ϧʔϓͷߋ৽ճ਺΍ α ͷબ୒ʹහײɽ − APPA (Frostig et al., 2015) ΋ࣅͨΑ͏ͳख๏ɽ 82 / 167
  76. Katyusha Katyusha = SVRG + Inner Acceleration + Katyusha momentum

    (Allen-Zhu, 2016, 2017) (STOC2017) Katyusha ͸ “negative momentum” Λ಺෦ϧʔϓͰ༻͍Δ:   yk = θ{xk−1 + β(xk−1 − xk−2 ) acceleration } + (1 − θ) x0 magnet = xk−1 + θβ(xk−1 − xk−2 ) + (1 − θ)(x0 − xk−1 ) Katyusha momentum   84 / 167
  77. Katyusha Katyusha (Allen-Zhu (2016)) For s = 0, . .

    . , S − 1: 1 µ(s) = ∇L(ˆ x(s)). 2 For j = 0, . . . , m − 1: 1 k = (sm) + j. 2 x[k+1] = τ1 zk + τ2 ˆ x(s) + (1 − τ1 − τ2 )y[k] (Ճ଎εςοϓ) 3 i ∈ {1, . . . , n} ΛҰ༷ʹαϯϓϦϯά͠ɼ෼ࢄॖখޯ഑Λܭࢉ: g = µ(s) + ∇ℓi (x[k+1] ) − ∇ℓi (ˆ x(s)) (෼ࢄॖখޯ഑). 4 z[k+1] = arg minz { 1 2α ∥z − z[k] ∥2 + ⟨g, z⟩ + ψ(z)}. 5 y[k+1] = arg miny {3L 2 ∥y − x[k+1] ∥2 + ⟨g, y⟩ + ψ(y)} 3 ˆ x(s+1) = 1 ∑ m−1 j=0 (1+ασ)j ( ∑ m−1 j=0 (1 + ασ)j ysm+j+1 ) ௨ৗͷڧತؔ਺ʹؔ͢ΔՃ଎͸ x[k+1] = τ1 zk + (1 − τ1 )y[k] ͱ͢Δ͕ɼˆ x(s) ੒෼΋ೖΕΔ͜ͱͰ SVRG Ͱ΋Ճ଎Λୡ੒ɽ ˠҎલͷղʹ͚ۙͮΔͷͰʮnegative momentumʯͱݴ͏ɽ τ1, τ2 , α ͸ద੾ʹઃఆʢ࣍ͷϖʔδʣ 85 / 167
  78. Katyusha ͷܭࢉྔ (ڧತ) ℓi ͕ γ-ฏ׈͔ͭɼψ ͕ λ-ڧತͷ࣌: m =

    2n, τ1 = min{ √ mλ 3γ , 1 2 }, τ2 = 1 2 , α = 1 3τ2γ ͱ͢Δ͜ͱͰ (n + √ nγ λ ) log(1/ϵ) ճͷߋ৽ճ਺Ͱޡࠩ ϵ ·Ͱ౸ୡɽ SVRG (n + γ λ ) log(1/ϵ) → Katyusha (n + √ nγ λ ) log(1/ϵ) ʹվળ ۙ઀ޯ഑๏ͷՃ଎๏͸ O(n √ γ/λ log(1/ϵ)) ͳͷͰ࠷େͰ √ n ഒ଎͍ɽ n = 106 ͳΒ 1000 ഒ଎͍ɽɹ 86 / 167
  79. Katyusha ͷܭࢉྔ (ඇڧತ) ℓi ͕ γ-ฏ׈͔ͭɼψ ͕ڧತͱ͸ݶΒͳ͍࣌: m = 2n,

    τ1 = 2 s+4 , τ2 = 1 2 , α = 1 3τ1γ (τ1 ͱ α ͸֤ s ͝ͱʹมԽͤ͞Δ) ͱ͢ Ε͹ n √ ϵ √ P(ˆ x(0) − P(x∗) + √ nγ ϵ ∥ˆ x(0) − x∗∥ = Ω ( n √ ϵ ) ճͷߋ৽ճ਺Ͱޡࠩ ϵ ·Ͱୡ੒ɽ ͞ΒʹɼAdaptReg (Allen-Zhu and Hazan (2016)) ͱ͍͏ํ๏ͱ૊Έ߹ΘͤΔ͜ ͱͰ n log(1/ϵ) + √ nγ ϵ ·Ͱվળ͢Δ͜ͱ͕Ͱ͖Δɽ ⇒ ۙ઀ޯ഑๏ͷՃ଎๏ͷܭࢉྔ O(n √ γ/ϵ) ͱൺ΂ͯ √ n ഒ଎͍ɽ 87 / 167
  80. Katyusha ͷܭࢉྔ·ͱΊ SVRG ͷՃ଎Λୡ੒. ܭࢉྔͷൺֱ µ-strongly convex Non-strongly convex GD

    O ( n log ( 1 ϵ )) O(n √ L ϵ ) SVRG O ( (n + κ) log ( 1 ϵ )) O ( n+L ϵ ) Katyusha O (( n + √ nκ ) log ( 1 ϵ )) O ( n log ( 1 ϵ ) + √ nL ϵ ) (κ = L/µ: ৚݅਺) ڧತͷՃ଎Ϩʔτ͸֬཰త࠲ඪ߱Լ๏ͷ APCG (Lin et al., 2014) ΍ओ૒ରܕͷ SPDC (Zhang and Lin, 2015) Ͱ΋ୡ੒Ͱ͖Δ. 88 / 167
  81. Doubly accelerated stochastic variance reduced dual averaging method (Murata and

    Suzuki, NIPS2017) minx P(x) = 1 n ∑ n i=1 fi (x) + ψ(x) = F(x) + ψ(x) ϛχόον๏: ֤ߋ৽Ͱ I ⊂ [n] (|I| = b) ͳΔϛχόονΛϥϯμϜʹબ୒ gt = 1 |I| ∑ i∈I (∇x fi (xt ) − ∇x fi (ˆ x)) + 1 n n ∑ i=1 ∇x fi (ˆ x)   ϛχόον SVRG + ೋॏՃ଎๏   90 / 167
  82. Doubly accelerated stochastic variance reduced dual averaging method (Murata and

    Suzuki, NIPS2017) minx P(x) = 1 n ∑ n i=1 fi (x) + ψ(x) = F(x) + ψ(x) ϛχόον๏: ֤ߋ৽Ͱ I ⊂ [n] (|I| = b) ͳΔϛχόονΛϥϯμϜʹબ୒ gt = 1 |I| ∑ i∈I (∇x fi (xt ) − ∇x fi (ˆ x)) + 1 n n ∑ i=1 ∇x fi (ˆ x)   ϛχόον SVRG + ೋॏՃ଎๏   ϛχόοναΠζ΁ͷґଘੑΛվળɽ Katyusha ͱൺ΂ͯνϡʔχϯάύϥϝʔλ͕গͳ͍ɽ ޡࠩ ϵ ·Ͱͷܭࢉྔɽ(b = |I| ͸ϛχόοναΠζ) Katyusha DASVRDA Strong conv. O (( n + √ bnκ ) log(1/ϵ) ) O (( n + √ nκ + b √ κ ) log(1/ϵ) ) Non-Strong O ( n log(1/ϵ) + √ b nL ϵ ) O ( n log(1/ϵ) + √ nL ϵ + b √ L ϵ ) ࠷େͰ √ b ഒ଎͍ɽ 90 / 167
  83. Algorithm (Murata and Suzuki, 2017) DASVRDA(˜ x0 , η, m,

    b, S) Iterate the following for s = 1, 2, . . . , S: ˜ ys = ˜ xs−1 + s−2 s+1 (˜ xs−1 − ˜ xs−2 ) + s−1 s+1 (˜ zs−1 − ˜ xs−2 ) (outer acceleration) (˜ xs, ˜ zs ) = One-Pass-AccSVRDA(˜ ys, ˜ xs−1, η, β, m, b) 92 / 167
  84. Algorithm (Murata and Suzuki, 2017) DASVRDA(˜ x0 , η, m,

    b, S) Iterate the following for s = 1, 2, . . . , S: ˜ ys = ˜ xs−1 + s−2 s+1 (˜ xs−1 − ˜ xs−2 ) + s−1 s+1 (˜ zs−1 − ˜ xs−2 ) (outer acceleration) (˜ xs, ˜ zs ) = One-Pass-AccSVRDA(˜ ys, ˜ xs−1, η, β, m, b) ڧತͷ৔߹ɼ͞ΒʹʮϦελʔτ๏ʯΛద༻͢Δɽ DASVRDAsc(ˇ x0 , η, m, b, S, T) Iterate the following for t = 1, 2, . . . , T: ˇ xt = DASVRDA(ˇ xt−1, η, m, b, S). 92 / 167
  85. One-Pass-AccSVRDA(x0 , ˜ x, η, m, b) Iterate the following

    for k = 1, 2, . . . , m: Sample Ik ⊂ {1, . . . , n} with size b uniformly. yk = xk−1 + k−1 k+1 (xk−1 − xk−2 ) (inner acceleration) vk = 1 b ∑ i∈Ik (∇fi (yk ) − ∇fi (˜ x)) + ∇F(˜ x) (variance reduction) ¯ vk = ( 1 − 2 k+1 ¯ vk−1 + 2 k+1 vk ) (dual averaging) zk = prox(x0 − ηk(k+1) 4 ¯ vk |ηk(k+1) 4 ψ) (prox. gradient descent) xk = ( 1 − 2 k+1 ) xk−1 + 2 k+1 zk (inner acceleration) Output: (xm, zm ) ΞΠσΟΞ: ಺෦Ճ଎ ͱ ֎෦Ճ଎ ͷ૊Έ߹Θͤ (Double acceleration) ֎ଆͷՃ଎ɿAPG (Accelerated Proximal Gradient) ɹ ৽͍͠ϞʔϝϯλϜ߲ s−1 s+1 (˜ zs−1 − ˜ xs−2 ) ΛՃ͑Δɽ ಺ଆͷՃ଎ɿAccSVRDA = AccSDA (Xiao, 2009) + ෼ࢄॖখ AccProxSVRG (Nitanda, 2014) ͷٕ๏΋ԉ༻ˠྑ͍ϛχόονޮ཰ੑɼϛχ όονʹΑΔ෼ࢄॖখΛ༻͍ͯΑΓੵۃతʹՃ଎ɽ 93 / 167
  86. ऩଋղੳ Ծఆ: ໨తؔ਺ P ͸ µ-ڧತ. ℓi ͸ γ-ฏ׈. Theorem

    (ڧತ) η = O ( 1 (1+n/b2)γ ) , S = O(1 + b n √ γ µ + √ γ nµ ), T = O(log(1/ϵ)) ͱ͢Δɽ DASVRDAsc(ˇ x0, η, n/b, b, S, T) (+ೋஈ֊Ճ଎๏) ͸ O (( n + √ nγ µ + b √ γ µ ) log ( 1 ϵ )) ճͷޯ഑ܭࢉͰ E[P(ˇ xT ) − P(x∗)] ≤ ϵ Λୡ੒ɽ Katyusha O((n + √ nb √ κ) log(1/ϵ)) DASVRDA O((n + ( √ n + b) √ κ) log(1/ϵ)) (κ := γ/µ) → min{ √ b, √ n/b} ഒ଎͍. b = max{ √ n, n/ √ κ} ͱ͢Ε͹ɼߋ৽ճ਺͸ = √ κ log(1/ϵ). 94 / 167
  87. Ծఆ: ℓi ͸ γ-ฏ׈ɽ Theorem (ඇڧತ) η = O (

    1 (1+n/b2)γ ) , S = O(1 + b n √ γ ϵ + √ γ nϵ ) ͱ͢Ε͹ɼ DASVRDA(x0, η, n/b, b, S) ͸ O ( n log(1/ϵ) + √ nγ ϵ + b √ γ ϵ ) ճͷޯ഑ܭࢉͰ E[P(˜ xS ) − P(x∗)] ≤ ϵ Λୡ੒ɽ (͜ͷϨʔτΛୡ੒͢Δʹ͸ warm-start Λ࠷ॳʹೖΕΔඞཁ͕͋Δɽ) Katyusha O((n log(1/ϵ) + √ nb √ γ/ϵ)) DASVRDA O((n log(1/ϵ) + ( √ n + b) √ γ/ϵ)) → min{ √ b, √ n/b} ഒ଎͍. b = √ n ͱ͢Ε͹ɼߋ৽ճ਺͸ = √ γ/ϵɽ 95 / 167
  88. ϛχόοναΠζͷ࠷దੑ ద੾ͳϛχόοναΠζΛ༻͍ͨ DASVRDA ͸ߋ৽ճ਺ɼશܭࢉྔͱ΋ʹ΄΅ ࠷ద (Nitanda, Murata, and Suzuki, 2019)

    (see also Nesterov (2004), Woodworth and Srebro (2016), Arjevani and Shamir (2016)). શܭࢉྔ (ޯ഑ͷܭࢉճ਺) = ϛχόοναΠζ ʷ ߋ৽ճ਺ શܭࢉྔ ߋ৽ճ਺ GD n √ κ log(1/ϵ) √ κ log(1/ϵ) DASVRDA O (( n + √ nκ ) log(1/ϵ) ) O ( √ κ log(1/ϵ)) (with b∗) Optimal Ω(n + √ nκ log(1/ϵ)) Ω( √ κ log(1/ϵ)) ʢڧತͷ৔߹ɼ࠷దϛχόοναΠζ b∗ = √ n + n √ κ log(1/ϵ) ʣ ྫ͑͹ n ≤ κ ͷ࣌ɼϛχόοναΠζΛ b ≫ b∗ = Θ( √ n) ͱͯ͠΋ແବɽ ͭ·Γɼ௨ৗͷޯ഑๏ͷΑ͏ͳେόοναΠζͷख๏͸ԿʹͤΑಘΛ͠ͳ͍ɽ NJOJCBUDITJ[F NJOJCBUDITJ[F NJOJCBUDITJ[F NJOJCBUDITJ[F ඇڧತͷ৔߹΋ಉ༷ɽ 97 / 167
  89. όονܕ֬཰త࠷దԽख๏ͷ·ͱΊ ֤छख๏ͷੑ࣭ ख๏ SDCA SVRG SAG/SAGA ओ/૒ର ૒ର ओ ओ

    ϝϞϦޮ཰   △ Ճ଎ (µ > 0)  Katyusha/DASVRDA Catalyst ͦͷଞ੍໿ ℓi (β) = fi (x⊤ i β) ೋॏϧʔϓ ฏ׈ͳਖ਼ଇԽ 98 / 167
  90. όονܕ֬཰త࠷దԽख๏ͷ·ͱΊ ֤छख๏ͷੑ࣭ ख๏ SDCA SVRG SAG/SAGA ओ/૒ର ૒ର ओ ओ

    ϝϞϦޮ཰   △ Ճ଎ (µ > 0)  Katyusha/DASVRDA Catalyst ͦͷଞ੍໿ ℓi (β) = fi (x⊤ i β) ೋॏϧʔϓ ฏ׈ͳਖ਼ଇԽ ऩଋϨʔτ (γ, µ ͷ log ߲͸আ͘) ख๏ µ > 0 µ = 0 Ճ଎๏ (µ > 0) ۙ઀ޯ഑๏ nγ µ log(1/ϵ) n √ γ/ϵ (Ճ଎) n √ γ µ log(1/ϵ) SDCA & SVRG (n + γ µ ) log(1/ϵ) O(n log(1 ϵ ) + √ nγ ϵ ) (n + √ nγ µ ) log(1/ϵ) SAG/SAGA (n + γ µ ) log(1/ϵ) γn/ϵ(˞) (n + √ nγ µ ) log(1/ϵ) ˞ µ = 0 ͷ࣌ɼCatalyst ͸ O((n + √ nγ/ϵ) log2(1/ϵ))ɼKatyusha ͸ AdaptReg ͱ߹Θͤ ͯ O(n log(1/ϵ) + √ nγ/ϵ) ͷऩଋϨʔτΛୡ੒. ۙ઀ޯ഑๏ ֬཰త࠷దԽ n √ γ µ log(1/ϵ) ≥ (n + √ nγ µ ) log(1/ϵ) 98 / 167
  91. ༗༻ͳจݙ Yurii Nesterov: Introductory Lectures on Convex Optimization: A Basic

    Course. Kluwer Academic Publishers, 2014. ˒ Guanghui Lan: First-order and Stochastic Optimization Methods for Machine Learning. Springer Series in the Data Sciences, Springer, 2020. ۚ৿ ܟจɼླ໦ େ࣊ɼ஛಺ Ұ࿠ɼࠤ౻ Ұ੣:ʰػցֶशͷͨΊͷ࿈ଓ࠷ద Խʱ ɽߨஊࣾɼ2016ɽ ླ໦େ࣊:ʰ֬཰త࠷దԽʱ ɽߨஊࣾɼ2015ɽ 99 / 167
  92. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 100 / 167
  93. Convex set Definition (Convex set) A convex set is a

    set that contains the segment connecting two points in the set: x1, x2 ∈ C =⇒ θx1 + (1 − θ2 )x2 ∈ C (θ ∈ [0, 1]). ತू߹ ඇತू߹ ඇತू߹ 101 / 167
  94. Epigraph and domain Let ¯ R := R ∪ {∞}.

    Definition (Epigraph and domain) The epigraph of a function f : Rp → ¯ R is given by epi(f ) := {(x, µ) ∈ Rp+1 : f (x) ≤ µ}. The domain of a function f : Rp → ¯ R is given by dom(f ) := {x ∈ Rp : f (x) < ∞}. FQJHSBQI EPNBJO > 102 / 167
  95. Convex function Let ¯ R := R ∪ {∞}. Definition

    (Convex function) A function f : Rp → ¯ R is a convex function if f satisfies θf (x) + (1 − θ)f (y) ≥ f (θx + (1 − θ)y) (∀x, y ∈ Rp, θ ∈ [0, 1]), where ∞ + ∞ = ∞, ∞ ≤ ∞. ತ ඇತ f is convex ⇔ epi(f ) is a convex set. 103 / 167
  96. Proper and closed convex function If the domain of a

    function f is not empty (dom(f ) ̸= ∅), f is called proper. If the epigraph of a convex function f is a closed set, then f is called closed. (We are interested in only a proper closed function in this lecture.) Even if f is closed, it’s domain is not necessarily closed (even for 1D). “f is closed” does not imply“f is continuous.” Closed convex function is continuous on a segment in its domain. Closed function is “lower semicontinuity.” 104 / 167
  97. Convex loss functions (regression) All well used loss functions are

    (closed) convex. The followings are convex w.r.t. u with a fixed label y ∈ R. Squared loss: ℓ(y, u) = 1 2 (y − u)2. τ-quantile loss: ℓ(y, u) = (1 − τ) max{u − y, 0} + τ max{y − u, 0}. for some τ ∈ (0, 1). Used for quantile regression. ϵ-sensitive loss: ℓ(y, u) = max{|y − u| − ϵ, 0} for some ϵ > 0. Used for support vector regression. f-y -3 -2 -1 0 1 2 3 0 1 2 3 τRVBOUJMF TFOTJUJWF 4RVBSFE )VCFS 105 / 167
  98. Convex surrogate loss (classification) y ∈ {±1} Logistic loss: ℓ(y,

    u) = log((1 + exp(−yu))/2). Hinge loss: ℓ(y, u) = max{1 − yu, 0}. Exponential loss: ℓ(y, u) = exp(−yu). Smoothed hinge loss: ℓ(y, u) =      0, (yu ≥ 1), 1 2 − yu, (yu < 0), 1 2 (1 − yu)2, (otherwise). yf -3 -2 -1 0 1 2 3 0 1 2 3 4  -PHJTUJD FYQ )JOHF 4NPPUIFEIJOHF 106 / 167
  99. Convex regularization functions Ridge regularization: R(x) = ∥x∥2 2 :=

    ∑ p j=1 x2 j . L1 regularization: R(x) = ∥x∥1 := ∑ p j=1 |xj |. Trace norm regularization: R(X) = ∥X∥tr = ∑min{q,r} k=1 σj (X) where σj (X) ≥ 0 is the j-th singular value. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 #SJEHF  - 3JEHF &MBTUJDOFU 1 n ∑ n i=1 (yi − z⊤ i x)2 + λ∥x∥1 : Lasso 1 n ∑ n i=1 log(1 + exp(−yi z⊤ i x)) + λ∥X∥tr : Low rank matrix recovery 107 / 167
  100. Other definitions of sets Convex hull: conv(C) is the smallest

    convex set that contains a set C ⊆ Rp. Affine set: A set A is an affine set if and only if ∀x, y ∈ A, the line that intersects x and y lies in A: λx + (1 − λ)y ∀λ ∈ R. Affine hull: The smallest affine set that contains a set C ⊆ Rp. Relative interior: ri(C). Let A be the affine hull of a convex set C ⊆ Rp. ri(C) is a set of internal points with respect to the relative topology induced by the affine hull A. $POWFYIVMM "⒏OFIVMM 3FMBUJWFJOUFSJPS 108 / 167
  101. Continuity of a closed convex function Theorem For a (possibly

    non-convex) function f : Rp → ¯ R, the following three conditions are equivalent to each other. 1 f is lower semi-continuous. 2 For any converging sequence {xn}∞ n=1 ⊆ Rp s.t. x∞ = limn xn , lim infn f (xn ) ≥ f (x∞ ). 3 f is closed. Remark: Any convex function f is continuous in ri(dom(f )). The continuity could be broken on the boundary of the domain. 109 / 167
  102. Outline 1 ౷ܭతֶशͷجຊతఆࣜԽ 2 Ұ༷ό΢ϯυ جຊతͳෆ౳ࣜ Rademacher ෳࡶ͞ͱ Dudley ੵ෼

    ہॴ Rademacher ෳࡶ͞ 3 ࠷దੑ ڐ༰ੑ minimax ࠷దੑ 4 ϕΠζͷֶशཧ࿦ 5 ػցֶशͷ࠷దԽ͓Αͼۙ઀ޯ഑๏ 6 ֬཰త࠷దԽ֓ཁ 7 ΦϯϥΠϯܕ֬཰త࠷దԽ ֬཰తޯ഑߱Լ๏ SGD ʹର͢Δ Nesterov ͷՃ଎๏ 8 όονܕ֬཰త࠷దԽ ֬཰త෼ࢄॖখޯ഑๏ 9 Appendix: Convex analysis Duality 110 / 167
  103. Subgradient We want to deal with non-differentiable function such as

    L1 regularization. To do so, we need to define something like gradient. Definition (Subdifferential, subgradient) For a proper convex function f : Rp → ¯ R, the subdifferential of f at x ∈ dom(f ) is defined by ∂f (x) := {g ∈ Rp | ⟨x′ − x, g⟩ + f (x) ≤ f (x′) (∀x′ ∈ Rp)}. An element of the subdifferential is called subgradient. f(x) x ྼඍ෼ 111 / 167
  104. Properties of subgradient Subgradient does not necessarily exist (∂f (x)

    could be empty). f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). 112 / 167
  105. Properties of subgradient Subgradient does not necessarily exist (∂f (x)

    could be empty). f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. ∂f (x) = {∇f (x)}. 112 / 167
  106. Properties of subgradient Subgradient does not necessarily exist (∂f (x)

    could be empty). f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. ∂f (x) = {∇f (x)}. If ri(dom(f )) ∩ ri(dom(h)) ̸= ∅, then ∂(f + h)(x) = ∂f (x) + ∂h(x) = {g + g′ | g ∈ ∂f (x), g′ ∈ ∂h(x)} (∀x ∈ dom(f ) ∩ dom(h)). 112 / 167
  107. Properties of subgradient Subgradient does not necessarily exist (∂f (x)

    could be empty). f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0. Subgradient always exists on ri(dom(f )). If f is differentiable at x, its gradient is the unique element of subdiff. ∂f (x) = {∇f (x)}. If ri(dom(f )) ∩ ri(dom(h)) ̸= ∅, then ∂(f + h)(x) = ∂f (x) + ∂h(x) = {g + g′ | g ∈ ∂f (x), g′ ∈ ∂h(x)} (∀x ∈ dom(f ) ∩ dom(h)). For all g ∈ ∂f (x) and all g′ ∈ ∂f (x′) (x, x′ ∈ dom(f )), ⟨g − g′, x − x′⟩ ≥ 0. 112 / 167
  108. Legendre transform Defines the other representation on the dual space

    (the space of gradients). Definition (Legendre transform) Let f be a (possibly non-convex) function f : Rp → ¯ R s.t. dom(f ) ̸= ∅. Its convex conjugate is given by f ∗(y) := sup x∈Rp {⟨x, y⟩ − f (x)}. The map from f to f ∗ is called Legendre transform. f(x) f*(y) MJOFXJUIHSBEJFOUy x x* 0 f(x*) 113 / 167
  109. Examples f (x) f ∗(y) Squared loss 1 2 x2

    1 2 y2 Hinge loss max{1 − x, 0} { y (−1 ≤ y ≤ 0), ∞ (otherwise). Logistic loss log(1 + exp(−x)) { (−y) log(−y) + (1 + y) log(1 + y) (−1 ≤ y ≤ 0), ∞ (otherwise). L1 regularization ∥x∥1 { 0 (maxj |yj | ≤ 1), ∞ (otherwise). Lp regularization ∑ d j=1 |xj |p ∑ d j=1 p−1 p p p−1 |yj | p p−1 (p > 1)    logistic dual of logistic L1 -norm dual 114 / 167
  110. Properties of Legendre transform f ∗ is a convex function

    even if f is not. f ∗∗ is the closure of the convex hull of f : f ∗∗ = cl(conv(f )). Corollary Legendre transform is a bijection from the set of proper closed convex functions onto that defined on the dual space. f (proper closed convex) ⇔ f ∗ (proper closed convex) 0 1 2 3 4 -6 -4 -2 0 2 4 6 G Y DMDPOW 115 / 167
  111. Connection to subgradient Lemma y ∈ ∂f (x) ⇔ f

    (x) + f ∗(y) = ⟨x, y⟩ ⇔ x ∈ ∂f ∗(y). ∵ y ∈ ∂f (x) ⇒ x = argmax x′∈Rp {⟨x′, y⟩ − f (x′)} (take the “derivative” of ⟨x′, y⟩ − f (x′)) ⇒ f ∗(y) = ⟨x, y⟩ − f (x). Remark: By definition, we always have f (x) + f ∗(y) ≥ ⟨x, y⟩. → Young-Fenchel’s inequality. 116 / 167
  112. ⋆ Fenchel’s duality theorem Theorem (Fenchel’s duality theorem) Let f

    : Rp → ¯ R, g : Rq → ¯ R be proper closed convex, and A ∈ Rq×p. Suppose that either of condition (a) or (b) is satisfied, then it holds that inf x∈Rp {f (x) + g(Ax)} = sup y∈Rq {−f ∗(A⊤y) − g∗(−y)}.   (a) ∃x ∈ Rp s.t. x ∈ ri(dom(f )) and Ax ∈ ri(dom(g)). (b) ∃y ∈ Rq s.t. A⊤y ∈ ri(dom(f ∗)) and −y ∈ ri(dom(g∗)).   If (a) is satisfied, there exists y∗ ∈ Rq that attains sup of the RHS. If (b) is satisfied, there exists x∗ ∈ Rp that attains inf of the LHS. Under (a) and (b), x∗, y∗ are the optimal solutions of the each side iff A⊤y∗ ∈ ∂f (x∗), Ax∗ ∈ ∂g∗(−y∗). → Karush-Kuhn-Tucker condition. 117 / 167
  113. Applying Fenchel’s duality theorem to RERM RERM (Regularized Empirical Risk

    Minimizatino): Let ℓi (z⊤ i x) = ℓ(yi , z⊤ i x) where (zi , yi ) is the input-output pair of the i-th observation.   (Primal) inf x∈Rp { n ∑ i=1 ℓi (z⊤ i x) + ψ(x) } f (Zx)   [Fenchel’s duality theorem] inf x∈Rp {f (Zx) + ψ(x)} = − inf y∈Rn {f ∗(y) + ψ∗(−Z⊤y)}   (Dual) sup y∈Rn { n ∑ i=1 ℓ∗ i (yi ) + ψ∗(−Z⊤y) }   This fact will be used to derive dual coordinate descent alg. 119 / 167
  114. A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J.

    Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transcations on Information Theory, 58(5): 3235–3249, 2012. K. Ahn. From proximal point method to nesterov’s acceleration, 2020. Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. arXiv preprint arXiv:1603.05953, 2016. Z. Allen-Zhu. Katyusha: the first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200–1205. ACM, 2017. Z. Allen-Zhu and E. Hazan. Optimal black-box reductions between optimization objectives. arXiv preprint arXiv:1603.05642, 2016. Y. Arjevani and O. Shamir. Dimension-free iteration complexity of finite sum optimization problems. In Advances in Neural Information Processing Systems, pages 3540–3548, 2016. P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33:1487–1537, 2005. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009. L. Bottou. Online algorithms and stochastic approximations. 1998. URL http://leon.bottou.org/papers/bottou-98x. revised, oct 2012. 119 / 167
  115. L. Bottou. Large-scale machine learning with stochastic gradient descent. In

    Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. Saul, and B. Sch¨ olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. URL http://leon.bottou.org/papers/bottou-lecun-2004. X. Chen, Q. Lin, and J. Pena. Optimal regularized dual averaging methods for stochastic optimization. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 395–403. Curran Associates, Inc., 2012. S. Dasgupta and A. Gupta. An elementary proof of the johnson-lindenstrauss lemma. Technical Report 99–006, U.C. Berkeley, 1999. A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1646–1654. Curran Associates, Inc., 2014. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. 119 / 167
  116. R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing:

    approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning, pages 2540–2548, 2015. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013. C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 781–789. Curran Associates, Inc., 2009. R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315–323. Curran Associates, Inc., 2013. W. B. Johnson, J. Lindenstrauss, and G. Schechtman. Extensions of lipschitz maps into banach spaces. Israel Journal of Mathematics, 54(2):129–138, 1986. G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012. 119 / 167
  117. N. Le Roux, M. Schmidt, and F. R. Bach. A

    stochastic gradient method with an exponential convergence rate for finite training sets. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2663–2671. Curran Associates, Inc., 2012. H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. Technical report, 2015. arXiv:1506.02186. Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3059–3067. Curran Associates, Inc., 2014. T. Murata and T. Suzuki. Doubly accelerated stochastic variance reduced dual averaging method for regularized empirical risk minimization. In Advances in Neural Information Processing Systems 30, pages 608–617. 2017. A. Nemirovskii and D. Yudin. On cezari ʟ s convergence of the steepest descent method for approximating saddle points of convex-concave functions. Soviet Mathematics Doklady, 19(2):576–601, 1978. A. Nemirovsky and D. Yudin. Problem complexity and method efficiency in optimization. John Wiley, New York, 1983. Y. Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic Publishers, 2004. 119 / 167
  118. A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In

    Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1574–1582. Curran Associates, Inc., 2014. A. Nitanda, T. Murata, and T. Suzuki. Sharp characterization of optimal minibatch size for stochastic finite sum convex optimization. In 2019 IEEE International Conference on Data Mining (ICDM), pages 488–497, 2019. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In J. Langford and J. Pineau, editors, Proceedings of the 29th International Conference on Machine Learning, pages 449–456. Omnipress, 2012. ISBN 978-1-4503-1285-1. H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951. F. Rosenblatt. The perceptron: A perceiving and recognizing automaton. Technical Report Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab., 1957. M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. IEEE Transactions of Information Theory, 39, 2013. 119 / 167
  119. D. Ruppert. Efficient estimations from a slowly convergent robbins-monro process.

    Technical report, Cornell University Operations Research and Industrial Engineering, 1988. M. Schmidt, N. Le Roux, and F. R. Bach. Minimizing finite sums with the stochastic average gradient, 2013. hal-00860051. S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14: 567–599, 2013. Y. Singer and J. C. Duchi. Efficient learning using forward-backward splitting. In Advances in Neural Information Processing Systems, pages 495–503, 2009. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1139–1147, 2013. B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3639–3647. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/ 6058-tight-complexity-bounds-for-optimizing-composite-objectives. pdf. 119 / 167
  120. L. Xiao. Dual averaging methods for regularized stochastic learning and

    online optimization. In Advances in Neural Information Processing Systems 23. 2009. L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010. L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24:2057–2075, 2014. Y. Zhang and X. Lin. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 353–361, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/zhanga15.html. 119 / 167