Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Trends in Deep Learning Theory at NeurIPS 2019

Trends in Deep Learning Theory at NeurIPS 2019

Deep learning has been adopted in many application fields in recent years because of its high performance. On the other hand, there are many issues about the understanding of generalization performance and learning theory that cannot be explained by existing theories, and many studies to tackle these issues were presented at NeuroIPS2019. In the application side of deep learning, reports on new problems in deep learning, including issues represented by the Adversarial Example, and their solutions have attracted attention. This presentation outlines recent research trends on the theory of deep learning, focusing on the above two topics.

Hiroki Naganuma

March 16, 2020
Tweet

More Decks by Hiroki Naganuma

Other Decks in Research

Transcript

  1. ௕প େथ (Hiroki Naganuma) 1,2 1 ౦ژ޻ۀେֶ ৘ใཧ޻ֶӃ (School of

    Computing, Tokyo Institute of Technology) 2 ౦ژ޻ۀେֶ ֶज़ࠃࡍ৘ใηϯλʔ (Global Scientific Information and Computing Center, Tokyo Institute of Technology) ਂ૚ֶशͷཧ࿦ͷಈ޲@NeurIPS2019: The trend of NeurIPS2019 around the scope of Deep Learning, Especially, Generalization, Global Convergence, Learning Theory, Adversarial Example - Survey of these topics from the bird's-eye view -
  2. ɹ ஶऀ঺հ ௕প େथ (Hiroki Naganuma) ॴଐ : • ౦ژ޻ۀେֶ

    ৘ใཧ޻ֶӃ ԣాཧԝݚڀࣨ • ଙਖ਼ٛҭӳࡒஂ ୈҰظ঑ֶੜ • ೔ຊֶज़ৼڵձ ಛผݚڀһ • PhD student in Computer Science, Mila, Université de Montréal (Fall 2020) ݚڀڵຯ : • [Ԡ༻] ਂ૚ֶशͷ෼ࢄฒྻʹΑΔߴ଎Խɺߴ଎Ͱ൚Խੑೳͷߴ͍࠷దԽख๏ͷ։ൃ • [ཧ࿦] ਂ૚χϡʔϥϧωοτϫʔΫͷֶशμΠφϛΫεͷཧղ 2
  3. ɹ ͸͡Ίʹ (1/2) ஫ҙ • ຊൃදࢿྉ͸Ͱ͖Δ͔͗ΓޡΓͷͳ͍Α͏ۈΊ͓ͯΓ·͕͢ɺ࡞੒ऀࣗ਎΋ஶऀ΄Ͳͷཧղ͕े෼Ͱͳ͍ͨΊ ޡͬͨ಺༰ΛؚΉՄೳੑ͕͋Γ·͢ • NeurIPS2019ͷ࠾୒࿦จ͸1428ຊ͋Γɺͦͷத͔Βຊൃදͷείʔϓͷ಺༰Λ300ຊఔ౓؆қαʔϕΠɺ ൃදࢿྉͰ͸ͦͷத͔Βɺ࡞੒ऀ͕౰֘෼໺ͷਐలͱͯ͠஫໨ʹ஋͢Δͱ൑அͨ͠࿦จΛ25ຊఔ౓ΛϐοΫ

    Ξοϓ͠ɺτϐοΫ͝ͱʹ঺հ͢ΔͨΊɺಡऀͷํͷڵຯͷ͋Δ࿦จΛؚ·ͳ͍Մೳੑ͕͋Γ·͢ • ଟ͘ͷ࿦จΛ࣌ؒ಺ʹ঺հ͢ΔͨΊɺ਺ֶతͳݫີ͞ΑΓ΋௚ײతͳཧղΛ༏ઌͨ͠঺հΛ͠·͢ • ൃද࣌ؒͷؔ܎Ͱࢿྉ಺༰ΛҰ෦লུ͍͖ͤͯͨͩ͞·͢(εϥΠυͷࠨଆʹԫ৭ͷଳ͕͋ΔϖʔδΛলུ) ର৅ • ਂ૚ֶशʹ͍ͭͯɺ͋Δఔ౓ͷࣄલ஌͕ࣝ͋ΓɺNeurIPSʹ͓͚Δ࠷৽ݚڀͷಈ޲Λ஌Γ͍ͨํ • ࠷৽ͷ൚ԽੑೳɾେҬతऩଋੑɾֶशཧ࿦ʹ͍ͭͯͷԠ༻ख๏ΛऔΓೖΕ͍ͨํ ※ AEsɾਂ૚ֶशͷ൚Խੑೳղੳɾֶशཧ࿦Λݚڀ͞Ε͍ͯΔํ͸͢Ͱʹ஌͍ͬͯΔ಺༰͕ଟ͍ͱࢥΘΕ·͢ 3
  4. ɹ ͸͡Ίʹ (2/2) ਂ૚ֶशʹཧ࿦͸ඞཁ͔ʁ • ͱΓ͋͑ͣAdamΛ͔͍ͭͬͯΕ͹ɺͦΕͳΓʹੑೳͷग़Δֶश͕Ͱ͖Δʁ • ֶशσʔλΛେྔʹूΊΕ͹ੑೳ͕ग़ΔͷͰ໰୊ͳ͍ʁ • ΤίγεςϜ͕ἧ͍ͬͯΔͨΊɺCuda,

    C++, Make, PrototxtΛ஌Βͣͱ΋ɺGithubʹམͪͯΔίʔυΛ࢖ͬͯ Python͚ͩͰGPUΛ࢖ͬͨSOTAͳֶश͕Մೳʁ ཧ࿦ͷඞཁੑ • HyperparamΛؚΊֶͨशͷઃఆ͕Θ͔Βͳ͍ࣄʹΑΔҋӢͳࢼߦࡨޡ͸ܭࢉίετ͕ແବʹ͔͔Δ • ϒϥοΫϘοΫεతͳڍಈͰɺʮਪఆʯͱ͍͏ΑΓ͸ʮ༧ଌʯͰ͋Γɺ৴པੑɾઆ໌Մೳੑʹ๡͍͠ 4 1. ࣮༻ԽΛ͢͢ΊΔͨΊ • ෳࡶͳಈ࡞ʹ͸ɺͦΕͳΓͷཧղ͕ඞཁ • ڝ߹͢Δ௚ײΛ੔ཧ͠ɾαϙʔτ͢ΔఆཧʹΑΓ৽͍͠ಎ࡯ͱ֓೦Λಋ͘͜ͱ͕Մೳʹ 2. ମܥతͳํ๏࿦ͱཧ࿦Λཱ֬͠ɺਂ૚ֶशΛʮ࿉ۚज़ʯͰ͸ͳ͘ʮՊֶʯʹ 3. ਖ਼͍͠࢖͍ํɾଥ౰ੑͷ֬ೝ • ͦ΋ͦ΋ԿΛ΍͍ͬͯΔͷ͔ɺղ͕ಘΒΕΔอূ͸͋Δͷ͔ɺ࠷దͳख๏ͳͷ͔Λ֬ೝ͢Δʹ͸͋Δఔ౓ͷ ཧ࿦ͷཧղ͕ඞཁ
  5. ɹ NeurIPS2019ͷ࣮ݧख๏ͷαʔϕΠ (1/3) 6 ػցֶश෼໺ͷτοϓձٞʹ͓͚Δ࣮ݧͷ৴པੑ From Xavier Bouthillier, Gaël Varoquaux,

    "Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020”, Research Report Inria Saclay Ile de France • ػցֶश෼໺ʹ͓͚ΔτοϓձٞͰ͋ΔNeurIPS2019ٴͼICLR2020ʹ࠾୒͞Εͨ࿦จͷ࣮ݧํ๏ʹ͍ͭͯMila ͔ΒαʔϕΠใࠂ͕ͳ͞Ε͍ͯΔ [Xavier+, 2020] • యܕతͳ࣮ݧύϥμΠϜ͸ɺϕϯνϚʔΫσʔληοτ্ͰϞσϧͷ൚ԽޡࠩΛଌఆ͠ɺଟ͘ͷ৔߹ɺϕʔεϥ ΠϯϞσϧͱൺֱ͢Δ͜ͱͰ͋Γɺ৴པͰ͖Δ࣮ݧ͸೉͍͠ • ػցֶशݚڀʹ͓͚Δ͜ΕΒͷ՝୊ʹର͢Δೝ͕ࣝ ߴ·͓ͬͯΓɺΑΓྑ͍ϋΠύʔύϥϝʔλͷ νϡʔχϯά [Bergstra and Bengio, 2012; Hansen, 2006; Hutter et al., 2018] Λ͸͡ΊɺϞσϧੑೳͷ ੍ޚ͞Εͨ౷ܭతݕఆ [Bouckaert and Frank, 2004]ͳͲ͕ݚڀ͞Ε͍ͯΔ • [Xavier+, 2020] ʹ͓͚ΔαʔϕΠͷճ౴཰͸ NeurIPS2019͸35.6%, ICLR2020͸48.6% • NeurIPS2019Ͱ͸86.2%͕ɺICLR2020Ͱ͸96.1%͕ Empiricalͳ࣮ݧΛؚΜͩใࠂ͕ͳ͞Ε͍ͯΔ
  6. ɹ NeurIPS2019ͷ࣮ݧख๏ͷαʔϕΠ (2/3) 7 ػցֶश෼໺ͷτοϓձٞʹ͓͚Δ࣮ݧͷ৴པੑ From Xavier Bouthillier, Gaël Varoquaux,

    "Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020”, Research Report Inria Saclay Ile de France • ػցֶश෼໺ʹ͓͚ΔτοϓձٞͰ͋ΔNeurIPS2019ٴͼICLR2020ʹ࠾୒͞Εͨ࿦จͷ࣮ݧํ๏ʹ͍ͭͯMila ͔ΒαʔϕΠใࠂ͕ͳ͞Ε͍ͯΔ [Xavier+, 2020] • Empiricalͳ࿦จʹ͍ͭͯɺHyperParameterͷ࠷దԽ͸ߦ͔ͬͨͲ͏͔ʁ(ࠨਤ) • ͲͷΑ͏ͳํ๏ͰHyperParameterͷ࠷దԽΛߦͬͨͷ͔ʁ(ӈਤ)
  7. ɹ NeurIPS2019ͷ࣮ݧख๏ͷαʔϕΠ (3/3) 8 ػցֶश෼໺ͷτοϓձٞʹ͓͚Δ࣮ݧͷ৴པੑ From Xavier Bouthillier, Gaël Varoquaux,

    "Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020”, Research Report Inria Saclay Ile de France • ػցֶश෼໺ʹ͓͚ΔτοϓձٞͰ͋ΔNeurIPS2019ٴͼICLR2020ʹ࠾୒͞Εͨ࿦จͷ࣮ݧํ๏ʹ͍ͭͯMila ͔ΒαʔϕΠใࠂ͕ͳ͞Ε͍ͯΔ [Xavier+, 2020] • ࠷దԽΛߦͬͨHyperParameterͷ਺͸ʁ(ࠨਤ) • HyperParameterͷ࠷దԽʹࡍͯ͠Կ౓TrialΛߦͬͨͷ͔ʁ(ӈਤ)
  8. ɹ NeurIPS2019ͷτϨϯυτϐοΫ (1/4) 9 ࿦จʹؔ͢ΔNeurIPS2019ͷσʔλ From https://medium.com/@NeurIPSConf/what-we-learned-from-neurips-2019-data-111ab996462c • ࿦จ౤ߘ਺ 6,743

    (4,543 reviewers) • ࠾୒࿦จ਺ 1,428 (acceptance rate of 21.6%) - ޱ಄ൃද 36݅ / 164݅ 5-෼ͷεϙοτϥΠτൃද • 75% ͷ࿦จ͕ camera-ready version ·Ͱʹ code ͕ఴ෇͞Ε͍ͯͨ
  9. ɹ NeurIPS2019ͷτϨϯυτϐοΫ (3/4) 11 NeurIPS2019ͷτϨϯυ·ͱΊ • ਂ૚ֶशͷഎޙʹ͋Δཧ࿦Λ୳ٻ͢Δଟ͘ͷ࿦จ͕ൃද͞Εͨ 1. ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ •

    ϕΠζਂ૚ֶश • άϥϑχϡʔϥϧωοτϫʔΫ • ತ࠷దԽ 2. ਂ૚ֶश΁ͷ৽͍͠Ξϓϩʔν 3. ػցֶश΁ͷਆܦՊֶͷಋೖ 4. ڧԽֶश 5. ৽͍͠໰୊ͱͦͷղܾࡦ • ఢରతαϯϓϧ (AEs :Adversarial Examples) • ϝλֶश • ੜ੒Ϟσϧ
  10. ɹ NeurIPS2019ͷτϨϯυτϐοΫ (4/4) 12 NeurIPS2019ͷτϨϯυ·ͱΊ • ਂ૚ֶशͷഎޙʹ͋Δཧ࿦Λ୳ٻ͢Δଟ͘ͷ࿦จ͕ൃද͞Εͨ 1. ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ •

    ϕΠζਂ૚ֶश • άϥϑχϡʔϥϧωοτϫʔΫ • ತ࠷దԽ 2. ਂ૚ֶश΁ͷ৽͍͠Ξϓϩʔν 3. ػցֶश΁ͷਆܦՊֶͷಋೖ 4. ڧԽֶश 5. ৽͍͠໰୊ͱͦͷղܾࡦ • ఢରతαϯϓϧ (AEs :Adversarial Examples) ͳͲ • ϝλֶश • ੜ੒Ϟσϧ Main-Focus in This Slide Sub-Focus in This Slide
  11. ɹ ຊࢿྉͷ֓ཁ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦: Over-Parameterized DNN, Winning Ticket Hypothesis ͳͲ ֶशʹؔΘΔཧ࿦:

    Dynamics of SGD Training, Regularization, Normarization, Initialization, Optimizer, LR ͳͲ Adversarial Examples, Large Batch Problem ͳͲ 13 ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ ৽͍͠໰୊ͱͦͷղܾࡦ
  12. ɹ ຊࢿྉͷ֓ཁ 14 ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦: Over-Parameterized DNN, Winning Ticket Hypothesis ͳͲ

    ֶशʹؔΘΔཧ࿦: Dynamics of SGD Training, Regularization, Normarization, Initialization, Optimizer, LR ͳͲ Adversarial Examples, Large Batch Problem ͳͲ ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ ৽͍͠໰୊ͱͦͷղܾࡦ
  13. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 15 എܠ ڭࢣ͋Γֶश • ೖྗͱڭࢣ৴߸͔ΒͳΔֶशࣄྫΛूΊֶͨशσʔλΛ༻͍Δ • ೖྗʹର͢ΔೝࣝϞσϧͷग़ྗ݁Ռͱڭࢣ৴߸ͱͷࠩҟΛଛࣦؔ ਺ͰධՁ

    • ֶशσʔλશମͷଛࣦؔ਺ͷฏۉΛ࠷খʹ͢ΔύϥϝʔλΛ୳ࡧ • ະ஌ͷσʔλʹର͢Δ༧ଌޡࠩͰ͋Δ൚ԽޡࠩΛ࠷খԽ͢Δ͜ͱ ͕໨ඪ ೝࣝϞσϧ (ύϥϝʔλ : ) ೖྗ ग़ྗ ڭࢣ৴߸ ଛࣦؔ਺ ޡࠩؔ਺(໨తؔ਺) : ֶशσʔλ ڭࢣ͋Γֶश (࠷খԽ໰୊) “car”
  14. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 16 എܠ ೝࣝϞσϧ (ύϥϝʔλ : ) ೖྗ ग़ྗ

    ڭࢣ৴߸ ଛࣦؔ਺ ޡࠩؔ਺(໨తؔ਺) : ֶशσʔλ ڭࢣ͋Γֶश (࠷খԽ໰୊) “car” (1/2)DNNϞσϧͷ൚Խೳྗ ڭࢣ͋Γֶश • ೖྗͱڭࢣ৴߸͔ΒͳΔֶशࣄྫΛूΊֶͨशσʔλΛ༻͍Δ • ೖྗʹର͢ΔೝࣝϞσϧͷग़ྗ݁Ռͱڭࢣ৴߸ͱͷࠩҟΛଛࣦؔ ਺ͰධՁ • ֶशσʔλશମͷଛࣦؔ਺ͷฏۉΛ࠷খʹ͢ΔύϥϝʔλΛ୳ࡧ • ະ஌ͷσʔλʹର͢Δ༧ଌޡࠩͰ͋Δ൚ԽޡࠩΛ࠷খԽ͢Δ͜ͱ ͕໨ඪ
  15. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 17 • Bias Variance Trade-off (ݹయతͳ౷ܭֶ) ࠷దԽ͞ΕΔϞσϧ Errorͷ࿨

    Variance Bias Complexity ൚Խޡࠩ: Generalization Error x Variance Bias Model Complex Model y Variance ↑ Bias ↓(approx 0) Simple Model Observed Data Prediction x y Variance ↓ (approx 0) Bias ↑ Observed Data Prediction എܠ: DNNϞσϧͷ൚Խೳྗ ݹయతͳ౷ܭֶͰ͸ɺModelͷComplexity͕ ্͕ΔͱVariance্͕͕Δ (݁ہError͕Լ͕Βͳ͍) ൚ԽޡࠩΛ Bias / Variance ʹ෼ׂͯ͠ߟ͑Δ
  16. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 18 • ࣮ࡍʹى͖͍ͯΔ͜ͱ : ߴ͍൚ԽੑೳΛୡ੒͢ΔDNNϞσϧ΄Ͳଟ͘ͷύϥϝʔλɾֶश͕࣌ؒඞཁ എܠ: DNNϞσϧͷ൚Խೳྗ طଘͷཧ࿦(Ϟσϧͷࣗ༝౓

    ≈ ύϥϝʔλ਺)Ͱ͸ɺ ݱ࣮ͷDNNͷੑೳ(ӈਤ)͕આ໌Ͱ͖ͳ͍ From "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks", Fig. 5(ࠨ). 1(ӈ), M. Tan+, ICML2019
  17. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 19 • Ϟσϧͷࣗ༝౓(≈ Complexity) ≈ ύϥϝʔλ਺ (ͱߟ͑ΒΕ͍ͯͨɺݱ࣮ͷDNNͷੑೳͱໃ६) ∵

    Ϟσϧͷࣗ༝౓͕େ͖͗͢Δͱաద߹͠΍͍͢ • ͔͠͠ɺਂ૚ֶशʹ͓͍ͯ͸ύϥϝʔλ਺͕ଟ͘ͱ΋ɺࣗ༝౓͕ඞͣ͠΋ ૿େ͠ͳ͍ ͜ͱ͕෼͔͖ͬͯͨ എܠ: DNNϞσϧͷ൚Խೳྗ ࣗ༝౓ ύϥϝʔλ਺ Ϊϟοϓ طଘཧ࿦ ࣮ࡍͷੑೳ ਂ૚ֶश ैདྷͷػցֶशख๏ Ϟσϧͷࣗ༝౓ ≈ ύϥϝʔλ਺ Ϟσϧͷࣗ༝౓ ≈ ύϥϝʔλ਺ ैདྷͷػցֶशख๏ ਂ૚ֶश
  18. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 20 എܠ: DNNϞσϧͷ൚Խೳྗ ͷू߹ ͷू߹ ൚Խޡࠩ(ظ଴஋) ܇࿅ޡࠩ(ܦݧฏۉ) ൚ԽΪϟοϓ

    (աֶश) θͷऔΓํʹΑͬͯɺ܇࿅ޡ͕ࠩ௿͘ɺ൚Խޡ͕ࠩ ߴ͍৔߹(աֶश)͕͋ΓಘΔ θͷऔΓํʹΑΒͳ͍ɺ൚ԽΪϟοϓͷҰ༷ͳό΢ϯυ ͕ඞཁ / Ϟσϧͷࣗ༝౓(≈ Complexity) ≈ ύϥϝʔλ ਺ ͱߟ͑ΒΕ͍ͯͨཧ༝ ൚ԽΪϟοϓ • ैདྷͷϞσϧͷ൚ԽޡࠩΛධՁ͢Δࢦඪ : Ұ༷ऩଋ (ࠨਤ)ෆਖ਼֬ͳධՁྫ (ӈਤ)ਖ਼֬ͱ͞Ε͖ͯͨධՁྫ Ұ༷ऩଋͰ͸ɺʮϞσϧͷࣗ༝౓ ≈ ύϥϝʔλ਺ʯ Λઆ໌Ͱ͖ͳ͍ͨΊɺͦͷΪϟοϓΛຒΊΔݚڀ͕ NeurIPS2019Ͱൃද͞Εͨ
  19. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 21 ೝࣝϞσϧ (ύϥϝʔλ : ) ೖྗ ग़ྗ ڭࢣ৴߸

    ଛࣦؔ਺ ޡࠩؔ਺(໨తؔ਺) : ֶशσʔλ ڭࢣ͋Γֶश (࠷খԽ໰୊) “car” (2/2)DNNϞσϧֶशͷେҬతऩଋੑ എܠ ڭࢣ͋Γֶश • ೖྗͱڭࢣ৴߸͔ΒͳΔֶशࣄྫΛूΊֶͨशσʔλΛ༻͍Δ • ೖྗʹର͢ΔೝࣝϞσϧͷग़ྗ݁Ռͱڭࢣ৴߸ͱͷࠩҟΛଛࣦؔ਺ͰධՁ • ֶशσʔλશମͷଛࣦؔ਺ͷฏۉΛ࠷খʹ͢ΔύϥϝʔλΛ୳ࡧ • ະ஌ͷσʔλʹର͢Δ༧ଌޡࠩͰ͋Δ൚ԽޡࠩΛ࠷খԽ͢Δ͜ͱ͕໨ඪ
  20. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 23 • େҬղΛอূ͢ΔͨΊͷཧ࿦ݚڀ => Over Parameterized (ա৒ύϥϝλԽ) =>

    શମతʹଛࣦΛԼ͛Δ(ଛࣦ͸0ҎԼʹͳΒͳ͍ੑ࣭Λ༻͍ͯɺ ہॴ࠷దղ͕େҬత࠷దղʹͳΔΑ͏ʹ͢Δ) ա৒ύϥϝλԽ Loss Loss Parameter Parameter େҬղ େҬղ എܠ: DNNϞσϧֶशͷେҬతऩଋੑ
  21. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 24 • େҬղΛอূ͢ΔͨΊͷཧ࿦ݚڀ => Over Parameterized (ա৒ύϥϝλԽ) =>

    ฏۉ৔ཧ࿦ ΍ Neural Tangent Kernel(NTK; NeurIPS2018)ʹΑΔΞϓϩʔν͕஫໨͞Ε͍ͯΔ => ຊൃදͰ͸ޙऀ(NTK)Λ঺հ ࠨهͷDNNΛFull-Batch(ޯ഑߱Լ๏) Ͱֶशͨ͠৔߹ͷ Loss͸Լهͷ௨Γ N͸Full BatchͰ͋ΓɺݻఆͰߟ͑ΔͨΊɺ࠷খԽΛߟ͑Δ্ͰऔΓআ͍ͯ Gradient DescentͰ࠷খԽ from "Neural Tangent Kernel: Convergence and Generalization in Neural Networks", A. Jacot+, NeurIPS 2018 1. NTK͸ॏΈมԽʹΑΔؔ਺ͷมԽʹண໨ 2. ա৒ύϥϝλԽ͞Εͨঢ়گͰ͸ॏΈ͸΄΅มԽͤͣఆ਺ͱΈͳͤΔ 3. 2Λ༻͍ͯDNNϞσϧΛઢܗۙࣅ͠ޯ഑๏ͷେҬతऩଋੑΛอূ എܠ: DNNϞσϧֶशͷେҬతऩଋੑ
  22. ॏΈมԽͷఆٛ = ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 25 • େҬղΛอূ͢ΔͨΊͷཧ࿦ݚڀ => Over Parameterized

    (ա৒ύϥϝλԽ) => Neural Tangent Kernelͷߟ͑ํΛ঺հ ॏΈͷมԽw͕େ͖͍΄Ͳ มԽ͕খ͍͞ ॏΈͷมԽ(Լهͷఆٛ)͸ width m͕େ͖͘ͳΔ΄Ͳ ͱͯ΋খ͘͞ͳΔ LossͷมԽͱมԽྔͷҟͳΔwʹ͓͚Δൺֱ / from https://rajatvd.github.io/NTK/ 2. ա৒ύϥϝλԽ͞Εͨঢ়گͰ͸ॏΈ͸΄΅มԽ ͤͣఆ਺ͱΈͳͤΔ m×mͷߦྻͷҟͳΔw(m=10,100,1000)ʹ͓͚Δൺֱ / from https://rajatvd.github.io/NTK/ എܠ: DNNϞσϧֶशͷେҬతऩଋੑ Animation Started
  23. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 26 • ͳͥɺw͕େ͖͍ͱॏΈ͕΄ͱΜͲมԽ͠ͳ͍͔ʁ => Taylorల։Λ༻͍ͯઆ໌ ඇઢܗؔ਺ͷNNΛҰ࣍ۙࣅʹΑΓwͷઢܗؔ਺Ͱදݱ -> ग़ྗ

    yʹؔ͢ΔϕΫτϧදهʹ͢Δ • େҬղΛอূ͢ΔͨΊͷཧ࿦ݚڀ => Over Parameterized (ա৒ύϥϝλԽ) => Neural Tangent Kernelͷߟ͑ํΛ঺հ ग़ྗyͷϕΫτϧදهʹؔ͢ΔߦྻαΠζ͸Լهͷ௨Γ ఆ਺ ఆ਺ ݁ՌɺॏΈwʹͷΈґଘ͢Δ ઢܗϞσϧʹͳΔ ࠷খೋ৐๏ଛࣦͷ࠷খԽ͕ ୯७ͳઢܗճؼͱͯ͠هड़Մೳ ఆ ਺ from https://rajatvd.github.io/NTK/ എܠ: DNNϞσϧֶशͷେҬతऩଋੑ
  24. ͳͥ Feature mapʹΑͬͯઢܗϞσϧԽͰ͖Δͷ͔ => ϠίϏΞϯ( )ͷมԽྔͰ͸ͳ͘ɺϠίϏΞϯͷ૬ରతมԽྔΛධՁ͢Δ͜ͱͰઆ໌ ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 27 •

    ͳͥɺw͕େ͖͍ͱॏΈ͕΄ͱΜͲมԽ͠ͳ͍͔ʁ => Taylorల։Λ༻͍ͯઆ໌ ඇઢܗؔ਺ͷNNΛҰ࣍ۙࣅʹΑΓwͷઢܗؔ਺ͰදݱͰ͖ͨͷͰɺ͜ͷઢܗۙࣅͷอূ৚͕݅஌Γ͍ͨ • େҬղΛอূ͢ΔͨΊͷཧ࿦ݚڀ => Over Parameterized (ա৒ύϥϝλԽ) => Neural Tangent Kernelͷߟ͑ํΛ঺հ ޯ഑ܭࢉʹؔͯ͠͸ɺೖྗxʹରͯ͠ઢܗԋࢉͰ͸ͳ͍ Feature map ͱͯ͠ఆٛ͞ΕΔ ॳظԽ࣌ͷϞσϧग़ྗͷޯ഑ʹΑͬͯઢܗϞσϧԽͰ͖Δʁ Ծఆʹ͓͚ΔϞσϧग़ྗ(ॳظԽ࣌ͷϞσϧग़ྗ)ͱ࣮ࡍͷग़ྗͷࠩ ॏΈʹؔ͢Δޯ഑ͷมԽྔ ॏΈۭؒʹ͓͚Δڑ཭dͷఆٛ = = ϔοηߦྻ ϠίϏΞϯͷϊϧϜ ϠίϏΞϯͷ૬ରతมԽ཰rͷఆٛ = = ϠίϏΞϯͷ૬ରతมԽྔ = d x r = ϠίϏΞϯͷ૬ରมԽྔΛκ(w0)ͱ͠ ͜ΕΛ࠷খԽ͍ͨ͠ from https://rajatvd.github.io/NTK/ എܠ: DNNϞσϧֶशͷେҬతऩଋੑ
  25. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 28 • ͳͥɺw͕େ͖͍ͱॏΈ͕΄ͱΜͲมԽ͠ͳ͍͔ʁ => ϠίϏΞϯͷ૬ରมԽ཰͕খ͍͞ => ޯ഑͕มԽ͠ͳ͍ =>

    ઢܗ • େҬղΛอূ͢ΔͨΊͷཧ࿦ݚڀ => Over Parameterized (ա৒ύϥϝλԽ) => Neural Tangent Kernelͷߟ͑ํΛ঺հ ඇઢܗؔ਺ͷNNΛҰ࣍ۙࣅʹΑΓwͷઢܗؔ਺ͰදݱͰ͖ͨͷͰɺ͜ͷઢܗۙࣅͷอূ͕ඞཁ ূ໌͸1-layerͷ৔߹Ͱ͢Β͔ͳΓෳࡶͳͷͰׂѪ͠·͕͢ɺm→∞Ͱκ(w0)→0 ͱͳΔ ※ ͜Ε͕੒Γཱͭͷ͸ɺLeCunͷॳظԽ͕ඞਢͰ͋ΓɺϨΠϠʔ͕૿͑ͯ΋ಉ͜͡ͱ͕੒Γཱͭ ϠίϏΞϯͷ૬ରతมԽྔ = d x r = ϠίϏΞϯͷ૬ରมԽྔΛκ(w0)ͱ͠ ͜ΕΛ࠷খԽ͍ͨ͠ from https://rajatvd.github.io/NTK/ ͜ΕΒͷ݁Ռ͔Βɺे෼ʹʚ(y(w₀)−ȳ)‖ ͷมԽΛ༠ൃ͢ΔͨΊ(ֶशΛߦ͏ͨΊ)ͷwͷมԽྔ͸ɺ ϠίϏΞϯ㲆wy(w)ʹ΄ͱΜͲӨڹΛ༩͑ͳ͍ఔʹඍখͰ͋Δ͜ͱ͕Θ͔Δ എܠ: DNNϞσϧֶशͷେҬతऩଋੑ
  26. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 29 • େҬղΛอূ͢ΔͨΊͷཧ࿦ݚڀ => Over Parameterized (ա৒ύϥϝλԽ) =>

    Neural Tangent Kernelͷߟ͑ํΛ঺հ • [L. Chiza+, 2019] ΛྫʹɺઢܗۙࣅͷอূΛ ϠίϏΞϯͷ૬ରతมԽྔͷ؍఺͔Β؆୯ʹ঺հ ϠίϏΞϯͷ૬ରతมԽྔ = d x r = ॏΈͷॳظԽ࣌ʹy(w_0)=0ͱͳΔΑ͏ʹ͢Δͱ ͜ͷͱ͖ɺ୯७ʹЋˠ∞ʹ͢ΔͱД_α(w_0) → 0 ॏཁͳͷ͸ɺ͜Ε͕ͲͷΑ͏ͳඇઢܗϞσϧͰ΋੒Γཱͭ͜ͱ ࠓɺyʹରͯ͠modelͷग़ྗΛЋഒͨ࣌͠Λߟ͑Δͱ 1-D Example from https://rajatvd.github.io/NTK/ എܠ: DNNϞσϧֶशͷେҬతऩଋੑ
  27. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 30 • େҬղΛอূ͢ΔͨΊͷཧ࿦ݚڀ => Over Parameterized (ա৒ύϥϝλԽ) =>

    Neural Tangent Kernelͷߟ͑ํΛ঺հ from https://rajatvd.github.io/NTK/ Gradient Descent࣌ͷTraining Dynamicsʹ͍ͭͯ ֶश཰Ж͕े෼ʹখ͍͞ͱͯ͠, ্ࣜΛඍ෼ͷ༗ݶࠩ෼ۙࣅͱݟཱͯͯԼࣜʹมܗ → ॏΈͷ࣌ؒൃలͱ͍͏ଊ͑ํ ্هΛɺϞσϧग़ྗ yͷdynamicsʹม׵͢Δͱ ͜ͷ੺͍෦෼ΛNeural Tangent Kernel(NTK)ͱ͍͏ → H(w0) શσʔλͷFeature Map ʹରԠ͢Δ಺ੵʹͳͬͯΔ ࣜมܗ • Ͱ͸ɺNeural Tangent Kernel (NTK)ͱ͸Կ͔ʁ ۙࣅ എܠ: DNNϞσϧֶशͷେҬతऩଋੑ
  28. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 31 • େҬղΛอূ͢ΔͨΊͷཧ࿦ݚڀ => Over Parameterized (ա৒ύϥϝλԽ) =>

    Neural Tangent Kernelͷߟ͑ํΛ঺հ from https://rajatvd.github.io/NTK/ ͜ͷ੺͍෦෼ΛNeural Tangent Kernel(NTK)ͱ͍͏ → H(w0) • Ͱ͸ɺNeural Tangent Kernel (NTK)ͱ͸Կ͔ʁ Ϟσϧ͕े෼ʹઢܗۙࣅʹͰ͖͍ͯΔ৔߹(κ(w0)→0)ɺModel ग़ྗͷϠίϏΞϯͷ෦෼͸΄΅ఆ਺ͳͷͰ ্هͷΑ͏ͳઢܗৗඍ෼ํఔࣜʹม׵Ͱ͖Δ ͱ͓͘ͱ ͜ΕΛղ͘ͱ Over-Parameterizedͳ৔߹, H(w_0)͸ৗʹਖ਼ఆ஋ߦྻʹͳΔͨΊɺH(w_0)ͷશͯͷ࠷খݻ༗஋͕ਖ਼Ͱ͋ΔͨΊݮਰ͢Δ ∴ scale α→∞ͷ࣌ɺઢܗۙࣅ͕อূ͞ΕͲͷΑ͏ͳඇઢܗϞσϧͰ͋ͬͯ΋ඞͣऩଋ͢Δ എܠ: DNNϞσϧֶशͷେҬతऩଋੑ
  29. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 32 • Asymmetric Valleys: Beyond Sharp and Flat

    Local Minima => SGD͕ଞͷOptimizerʹൺ΂൚Խ͢Δཧ༝Λɺଛࣦؔ਺ͷlandscape͔Βٞ࿦ • One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers => ൚Խੑೳͷߴ͍ղʹߦ͖΍͍͢ॳظ஋Ͱ͋Δwining ticketͷଘࡏΛ࣮ݧతʹࣔͨ͠ • Uniform convergence may be unable to explain generalization in deep learning =>ݹయతͳҰ༷ऩଋͷߟ͑Ͱ͸൚ԽޡࠩΛධՁͰ͖ͳ͍͜ͱΛओு ࿦จϦετ(1/5) ※྘Ͱࣔͨ͠࿦จͷΈൃදͰ͸঺հ͠·͢
  30. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 33 Asymmetric Valleys: Beyond Sharp and Flat Local

    Minima Better Parameter Space SharpMinimum : f(x*)࠷খͱͳΔ࠷దղx*ͷۙ๣x*+δʹ͓ ͍ͯɺf(x*+δ)ͷ஋͕ٸ଎ʹ૿Ճ͢ΔΑ͏ ͳ΋ͷͱఆٛɺ㲆^2f(x)ͷݻ༗஋ͷҰ͕ͭ ඇৗʹେ͖͍͜ͱʹಛ௃෇͚ΒΕΔ ൚Խੑೳͷ௿͍ͱ͞ΕΔ࠷దղ FlatMinimum : xͷൺֱతେ͖ͳۙ๣Ͱf(x*+δ)͕ Ώͬ͘ΓͱมԽ͢Δ΋ͷͱͯ͠ఆ ٛɺ㲆^2f(x)ͷଟ਺ͷখ͞ͳݻ༗஋Λ ༗͢Δ͜ͱͰಛ௃෇͚ΒΕΔ ൚Խੑೳͷߴ͍ͱ͞ΕΔ ࠷దղ DNNʹ͓͚Δଛࣦؔ਺ Flat MinimumͱSharp Minimumͷ֓೦ਤ Loss • DNNͷֶश͸Ұൠతʹඇತ࠷దԽͰ͋Δ͕ɺSGDΛ༻ֶ͍ͨशʹ͓͍ͯྑ޷ͳղ͕ݟ͔ͭΔ͜ͱ͕ܦݧతʹ஌ΒΕ͍ͯΔ • SGDͷֶश͕ɺFlat Minimaͱݺ͹ΕΔղ΁ͷऩଋ͕ଅ͞ΕΔ͜ͱ͕ɺྑ޷ͳ൚Խੑೳͱ૬͍ؔͯ͠Δͱਪଌ[1] • ͔͠͠ͳ͕ΒɺFlat MinimaͱSharp Minima ͸ Reparameterization ʹΑΓม׵Մೳͱ͍͏ࢦఠ΋[2] [ݚڀഎܠ] [Contribution] • ݱ୅ͷDNNͷଛࣦؔ਺͸ɺFlat MinimaͱSharp Minima ͚ͩͰ͸ͳ͘ɺ Assymmetric Valleys͕ଘࡏ͢Δ͜ͱΛࣔͨ͠ [1], On large-batch training for deep learning: Generalization gap and sharp minima. ICLR, 2017 [2], Sharp minima can generalize for deep nets. ICML, 2017 01
  31. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 34 Asymmetric Valleys: Beyond Sharp and Flat Local

    Minima [Asymmetric Minimumͷଘࡏ] from "Asymmetric Valleys: Beyond Sharp and Flat Local Minima", Fig. 1, H. He+, NeurIPS 2019 • Asymmetric Minimum: ํ޲͕ඇରশͳ࠷খղ(ӈਤ) • ଛࣦ͸ยଆͰ͸଎͘ɺ൓ରଆͰ͸ Ώͬ͘Γͱ૿Ճ • ܦݧLossͱظ଴ଛࣦͷؒʹϥϯμϜ ͳγϑτ͕͋Δ৔߹ɺ˒Ͱࣔͨ͠ղ ͷํ͕ظ଴ଛࣦ͕খ͘͞ͳΔ => ภͬͨղ˒ͷํ͕ҰൠԽ͠΍͍͢ • ղ͕ہॴղͷதͰ΋Ͳ͜ʹҐஔ͢Δ ͔͕ඇৗʹॏཁ • ઌݧతͳҰൠԽޡ͕ࠩ࠷΋খ͍͞ղ ͕ඞͣ͠΋ܦݧଛࣦͷ࠷খԽͰ͋Δ ͱ͸ݶΒͳ͍
  32. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 35 Asymmetric Valleys: Beyond Sharp and Flat Local

    Minima [Biased SolutionͷఏҊ] from "Asymmetric Valleys: Beyond Sharp and Flat Local Minima", Fig. 1(্). 3(ࠨԼ). 4(ӈԼ), H. He+, NeurIPS 2019 ˒Ͱࣔͨ͠ղ(Biased Solution)ͷํ͕ظ଴ଛࣦ͕খ͘͞ͳΔͱఏҊ => ภͬͨղ˒ͷํ͕൚Խ͠΍͍͢(ӈਤ) ্هΛɺԼهೋͭͷԾఆʹ͓͍ͯূ໌ 1. ܦݧLossͱظ଴Lossͷؒʹγϑτ͕ଘࡏ͢Δͱ͍͏ҰൠతͳԾఆ(ࠨԼɺதԝԼਤ) 2. ํ޲ϕΫτϧu^iʹ͓͍ͯظ଴Lossͷղw^͕ඇରশͷ৔߹ɺͦͷۙ๣Ͱ΋ඇରশͱͳΔ (ӈԼਤ) ্هೋͭͷԾఆΛɺܦݧతʹධՁ
  33. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 36 Asymmetric Valleys: Beyond Sharp and Flat Local

    Minima [Biased Solutionͷ࣮૷] from "Asymmetric Valleys: Beyond Sharp and Flat Local Minima", Fig. 1, H. He+, NeurIPS 2019 ˒Ͱࣔͨ͠ղ(Biased Solution)ͷํ͕ظ଴ଛࣦ͕খ͘͞ͳΔ => ภͬͨղ˒ͷํ͕ҰൠԽ͠΍͍͢ => ฏۉԽʹΑΓBiased Solution˒΁ऩଋͰ͖Δ Bias leads to better generalization ͕ܦݧଛࣦ࠷খղɺ ͕flatͳํ޲΁ͷόΠΞε w* c0 SGD averaging generates a bias ͕ܦݧଛࣦ࠷খղฏۉɺ ͕flatͳํ޲΁ͷόΠΞε ¯ w c0
  34. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 37 Asymmetric Valleys: Beyond Sharp and Flat Local

    Minima [࣮ݧ݁Ռͱ·ͱΊ] from "Asymmetric Valleys: Beyond Sharp and Flat Local Minima", Fig. 5(ࠨ). 7(தԝ্). 8(ӈ)Table. 1(Լ), H. He+, NeurIPS 2019 ଛࣦؔ਺ͷղͷฏۉԽ͸ɺطʹSWA[Pavel Izmailov+, 2018]͕ఏҊ͞Ε͓ͯΓɺख๏ͷਖ਼౰ੑΛཧ࿦ɾ࣮ݧతʹࣔͨ͠(Լਤ)
  35. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 38 One ticket to win them all: generalizing

    lottery ticket initializations across datasets and optimizers • The Lottery Ticket Hypothesis: DNNϞσϧѹॖʹؔ͢ΔԾઆ[1; ICLR2019 Best Paper] => ֶशࡁΈͷDNNʹ͸ɺݩͷDNNͷੑೳʹඖఢ͢Δ෦෼ωοτϫʔΫ(The Lottery Ticket)͕ଘࡏ͢Δͱ͢Δઆ => NN͕ద੾ʹॳظԽ͞Ε͍ͯΔݶΓɺখͯ͘͞εύʔεԽ͞ΕͨNNΛ܇࿅Ͱ͖Δ͜ͱΛࣔࠦ • ͔͠͠ͳ͕Βɺ෦෼ωοτϫʔΫ(The Lottery Ticket)ͷॳظԽΛൃݟ͢Δ͜ͱ͕ܭࢉྔతʹࠔ೉ͱ͍͏໰୊ => One-shot PruningΑΓIterative Pruningͷํ͕ੑೳ͕ߴ͍͕ɺܭࢉྔ͕ଟ͍ [ݚڀഎܠ] [Contribution] • ॳظԽͷ୳ࡧΛߦΘͳͯ͘ࡁΉΑ͏ɺ༷ʑͳDataSetͱOptimizerʹΘͨͬͯಉ͡෦෼ωοτϫʔΫ(The Lottery Ticket) ͷॳظԽ͕࠶ར༻Մೳ͔Ͳ͏͔Λݕূ => ը૾ܥͷλεΫʹ͓͍ͯɺେ͖ͳσʔληοτͰੜ੒ͨ͠The Lottery TicketͷॳظԽ͕ଞͷը૾σʔληοτɾ Optimizerʹ͓͍ͯ൚Խ͢Δ݁Ռͱͳͬͨ [1], The Lottery Ticket Hypothesis: Training Pruned Neural Networks. ICLR, 2019 PruningલͷNN Original Network Sub-Network 02
  36. from "One ticket to win them all: generalizing lottery ticket

    initializations across datasets and optimizers”, Fig. 3, A. Morcos+, NeurIPS 2019 • ը૾ܥͷλεΫʹ͓͍ͯɺେ͖ͳσʔληοτͰੜ੒ͨ͠The Lottery Ticket ͷॳظԽ͕ଞͷը૾σʔληοτʹ͓͍ͯ΋൚Խ͢Δ݁Ռʹ(্ه͸VGG19, ͜ΕҎ֎ʹResNet50Ͱ΋࣮ݧ) ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 39 One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers [࣮ݧ݁Ռ1/2]
  37. from "One ticket to win them all: generalizing lottery ticket

    initializations across datasets and optimizers ", Fig. 5, A. Morcos+, NeurIPS 2019 ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 40 One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers [࣮ݧ݁Ռ2/2, ·ͱΊ] • ը૾ܥͷλεΫʹ͓͍ͯɺେ͖ͳσʔληοτͰੜ੒ͨ͠The Lottery TicketͷॳظԽ͕ଞͷOptimizerʹ͓͍ͯ΋൚Խ • े෼ʹେ͖ͳσʔληοτʹΑͬͯੜ੒͞ΕͨThe Lottery TicketͷॳظԽ͕ɺΑΓ޿͘NNʹڞ௨͢ΔؼೲతόΠΞε ΛؚΜͰ͍Δ͜ͱΛࣔࠦ
  38. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 41 Uniform convergence may be unable to explain

    generalization in deep learning from "Uniform convergence may be unable to explain generalization in deep learning", J. Nagarajan+, NeurIPS 2019 • DNN͸ඇৗʹେ͖͍ύϥϝʔλ਺Ͱ΋ɺ൚Խ͢Δ͜ͱ͕஌ΒΕ͍ͯΔ • ཧ࿦తͳݚڀͰ͸ɺDNNͷҰൠԽΪϟοϓͷ্ݶ஋Λಋ͍͖͕ͯͨɺͦͷBound͸্ਤͷΑ͏ͳଆ໘(໰୊)͕͋Δ => ͜ΕΒͷڥք஋ͷଟ͘͸ɺҰ༷ऩଋͷߟ͑ํʹج͍͍ͮͯΔ͕ɺҰ༷ऩଋʹجͮ͘ํ޲ੑͰ͸ղܾͰ͖ͳ͍ͱࣔࠦ [֓ཁ] 03 طଘݚڀʹ͓͚Δಋग़͞ΕͨBoundͷΠϝʔδ
  39. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 42 Uniform convergence may be unable to explain

    generalization in deep learning from "Uniform convergence may be unable to explain generalization in deep learning", J. Nagarajan+, NeurIPS 2019 • VC࣍ݩͷྫʹ͓͍ͯɺඪ४తͳҰ༷ऩଋBound͸ɺDNNͰදݱՄೳͳؔ਺ͷΫϥεશମͷදݱෳࡶ౓ΛଌΔ => DNNͷύϥϝʔλ਺͸ଟ͘ɺ෼ࢠ͕େ͖͘ͳΓ͗͢ΔͨΊɺTightͳBound͕ಘΒΕͳ͍ • Ծઆ΍ू߹ۭؒΛݶఆ͠ɺΞϧΰϦζϜͱσʔλ෼෍ʹؔ࿈͢Δ΋ͷͷΈʹয఺Λ౰ͯɺΑΓTightͳBoundͷಋग़͕ ظ଴͞ΕΔ => ͔͠͠ɺͦΕΒͷ஋͸ґવͱେ͖͍͔ɺখ͍͞৔߹͸ѹॖ͞Εͨ৔߹ͳͲͷमਖ਼͞ΕͨNNʹ޲͚ͯͷ΋ͷͰ͋Δ [ݚڀഎܠ] Boundಋग़ͷ༷ʑͳํ๏ͱؔ࿈ݚڀ
  40. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 43 Uniform convergence may be unable to explain

    generalization in deep learning from "Uniform convergence may be unable to explain generalization in deep learning", Fig. 1, J. Nagarajan+, NeurIPS 2019 • ςετޡֶ͕ࠩशσʔλαΠζΛେ͖͘͢Δͱݮগ͢Δ • ൚ԽΪϟοϓBound͸NNͷॏΈϊϧϜʹґଘ͢ΔͨΊɺ͜ͷॏΈϊϧϜ͸ֶशσʔλαΠζ͕૿Ճ͢Δ΄Ͳ૿Ճ͢Δ => ֶशσʔλαΠζ͕૿Ճ͢Δͱ൚ԽΪϟοϓBound͕૿Ճ͢Δ [Contribution1]
  41. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 44 Uniform convergence may be unable to explain

    generalization in deep learning from "Uniform convergence may be unable to explain generalization in deep learning", J. Nagarajan+, NeurIPS 2019 • ͋ΒΏΔछྨͷચ࿅͞ΕͨҰ༷ऩଋ(U.C; Uniform Convergence)BoundΛؚΊɺ͋ΒΏΔҰ༷ऩଋBound͸ɺ ਂ૚ֶशͷ͋Δঢ়گʹ͓͍ͯ͸ɺҰൠԽΛઆ໌͢Δ͜ͱ͕Ͱ͖ͳ͍͜ͱΛࣔ͢ • ͦͷঢ়گʹ͓͍ͯ͸ɺ൚ԽΪϟοϓ͕খ͔ͬͨ͞ͱͯ͠΋ɺBound͕ҙຯΛͳ͞ͳ͍ • ূ໌ͷॏཁͳཁૉ => ࠷΋TightͳҰ༷ऩଋBound͕ɺ࠷ऴతʹ͸Vacuous(ແҙຯ)Ͱ͋Δ͜ͱΛࣔ͢ [Contribution2] ߟྀ͢Δඞཁͷ ͳ͍Ծઆू߹͸ શͯআ֎͢Ε͹ ࠷΋Tight Boundಋग़ʹؔͯ͠ͷ Ծઆू߹ݶఆͷ֓೦ਤ
  42. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 45 Uniform convergence may be unable to explain

    generalization in deep learning from "Uniform convergence may be unable to explain generalization in deep learning", J. Nagarajan+, NeurIPS 2019 • 1000࣍ݩͷ2ͭͷҰ༷ͳ௒ٿ෼෍Λ༩͑ΒΕͨ2஋෼ྨλεΫʹ͓͍ͯɺ෼཭͢ΔܾఆڥքΛֶश͢Δ͜ͱΛߟ͑Δ • ֶशσʔλ͕૿͑Δ΄ͲɺNN(1-hidden ReLU, 100K unit)ͷҰൠԽ͕վળ • ࠷΋TightͳҰ༷ऩଋͷࣦഊΛࣔ͢ɺ2ͭͷॏཁͳεςοϓ 1. ࢵͷ௒ٿ෼෍ঢ়ͷσʔλ఺ΛΦϨϯδʹProjection͠ɺσʔλϥϕϧΛೖΕସ͑ͨ(݁ہɺࢵͷσʔλ఺͕ݮΓɺΦϨϯ δͷσʔλ఺͕૿͑Δ)༗ޮͳσʔληοτS’Λ༻ҙ(ࠨਤ) 2. ςετޡࠩͱτϨʔχϯάޡ͕ࠩඇৗʹখ͍͞৔߹Ͱ΋ɺσʔληοτS’͕׬શʹޡ෼ྨ͞Ε͍ͯΔ͜ͱΛ࣮ূ(ӈਤ) [࣮ݧ] ڥք໘ͷֶश͸ σʔλϙΠϯτ ͝ͱͷ࿪ΈΛه Աͯ͠͠·͏͜ ͱ͕໰୊
  43. ɹ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦ 46 Uniform convergence may be unable to explain

    generalization in deep learning from "Uniform convergence may be unable to explain generalization in deep learning", Fig. 1(ࠨ), J. Nagarajan+, NeurIPS 2019 • ਂ૚ֶशʹ͓͚Δ൚ԽΛ׬શʹઆ໌͢ΔͨΊͷҰ༷ऩଋBoundͷ༗ޮੑʹٙ໰Λএ͑ͨ • ֶशσʔλαΠζ͕૿Ճ͢Δͱ൚ԽΪϟοϓBound͕૿Ճ͢Δ͜ͱΛࣔͨ͠(ࠨਤ) • ͢΂ͯͷҰ༷ऩଋڥք͕Vacuous(ແҙຯ)ʹͳΔઃఆͷଘࡏΛࣔͨ͠(ӈਤ) [݁࿦]
  44. ɹ ຊࢿྉͷ֓ཁ 47 ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦: Over-Parameterized DNN, Winning Ticket Hypothesis ͳͲ

    ֶशʹؔΘΔཧ࿦: Dynamics of SGD Training, Regularization, Normarization, Initialization, Optimizer, LR ͳͲ Adversarial Examples, Large Batch Problem ͳͲ ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ ৽͍͠໰୊ͱͦͷղܾࡦ ICLR2019ͰBestPaperʹͳͬͨWinning Ticket HypothesisͷτϐοΫ΍ େҬతऩଋੑΛߟ͑Δ্ͰUniform ConvergenceΛ࠶ߟ͢Δͱ͍ͬͨτϐοΫ͕ OverParameterizedͳঢ়گΛԾఆͨ͠৔߹ͷ൚Խੑೳղੳɺ ͓Αͼɺͦͷঢ়گͰ༻͍Δ͜ͱ͕Ͱ͖ΔNTKͰͷେҬతऩଋੑΛ༻͍ͯݚڀ͞Εͨ ͜ͷষͷ·ͱΊ
  45. ɹ ຊࢿྉͷ֓ཁ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦: Over-Parameterized DNN, Winning Ticket Hypothesis ͳͲ ֶशʹؔΘΔཧ࿦:

    Dynamics of SGD Training, Regularization, Normarization, Initialization, Optimizer, LR ͳͲ Adversarial Examples, Large Batch Problem ͳͲ 48 ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ ৽͍͠໰୊ͱͦͷղܾࡦ
  46. ɹ ֶशʹؔΘΔཧ࿦ 49 ֶशʹؔΘΔཧ࿦ͷࡉ෼Խ Training Dynamics of DNN Optimizer Characteristics

    Initialization Learning RateɾMomentumɾDataAugmentation RegularizationɾNormalization Training Loop Initialization Mini-Batch SGD Training w/ Regularization
  47. ɹ ֶशʹؔΘΔཧ࿦ 50 ࿦จϦετ(2/5) • Wide Neural Networks of Any

    Depth Evolve as Linear Models Under Gradient Descent • SGD dynamics for two-layer neural networks in the teacher-student setup • Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model • Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks • Lookahead Optimizer: k steps forward, 1 step back ֶशμΠφϛΫε Optimizerಛੑ • MetaInit: Initializing learning by learning to initialize • How to Initialize your Network? Robust Initialization for WeightNorm & ResNets ॳظԽ ※྘Ͱࣔͨ͠࿦จͷΈൃදͰ͸঺հ͠·͢
  48. ɹ ֶशʹؔΘΔཧ࿦ 51 ࿦จϦετ(3/5) • Painless Stochastic Gradient: Interpolation, Line-Search,

    and Convergence Rates • Using Statistics to Automate Stochastic Optimization • Stagewise Training Accelerates Convergence of Testing Error Over SGD • Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence • When does label smoothing help? • Understanding the Role of Momentum in Stochastic Gradient Methods ֶश཰ɾMomentumɾDataAugmentation • Understanding and Improving Layer Normalization • Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence • Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks ਖ਼نԽɾਖ਼ଇԽ ※྘Ͱࣔͨ͠࿦จͷΈൃදͰ͸঺հ͠·͢
  49. ɹ ֶशʹؔΘΔཧ࿦ 52 ֶशʹؔΘΔཧ࿦ͷࡉ෼Խ Training Dynamics of DNN Optimizer Characteristics

    Initialization Learning RateɾMomentumɾDataAugmentation RegularizationɾNormalization Training Loop Initialization Mini-Batch SGD Training w/ Regularization
  50. ɹ ֶशʹؔΘΔཧ࿦ [ֶशμΠφϛΫε] 53 from "Wide Neural Networks of Any

    Depth Evolve as Linear Models Under Gradient Descent", J. Lee+, NeurIPS 2019 • DNNͷޯ഑߱Լ๏ʹΑΔֶशμΠφϛΫεͷཧղ͸ࠔ೉ => ෯ͷ޿͍NNͰ͸ɺֶशμΠφϛΫε͕͔ͳΓ୯७Խ͞ΕΔ => ແݶ෯Ͱ͸ɺॳظύϥϝʔλΛத৺ͱͨ͠NNͷҰ࣍ςΠϥʔల։͔ΒಘΒΕΔઢܗϞσϧʹΑͬͯࢧ഑͞ΕΔ • NNͷ෯͕ແݶେʹۙͮ͘ʹͭΕɺNNͷग़ྗ͕Ψ΢εաఔʹऩଋ͢Δ • ͜ΕΒͷཧ࿦తͳ݁Ռ͸ɺແݶͷ෯ͷݶքʹ͓͍ͯͷΈਖ਼֬ => ༗ݶ෯ͷNNʹ͓͍ͯ΋ɺݩͷNNͷ༧ଌ஋ͱઢܗԽ͞ΕͨNNͷ༧ଌ஋ͱͷؒͷҰகΛ֬ೝ Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent [֓ཁ] 04 ਂ૚NNͷແݶ෯ͰͷΨ΢εաఔ΁ͷ઴ۙͷ֓೦ਤ
  51. ɹ ֶशʹؔΘΔཧ࿦ [ֶशμΠφϛΫε] 54 A. ύϥϝʔλۭؒͷμΠφϛΫεɿ ύϥϝʔλۭؒʹ͓͚Δ෯ͷ޿͍ωοτϫʔΫͷֶशμΠφϛΫε͸ɺ͢΂ͯͷωοτϫʔΫύϥϝʔλͷू߹͕ ΞϑΟϯܥͰ͋ΔϞσϧͷֶशμΠφϛΫεͱ౳Ձ B. ઢܗԽͷͨΊͷे෼ͳ৚݅ɿ

    ᮢ஋ͱͳΔֶश཰ ͕ଘࡏɺͦͷᮢ஋ΑΓ΋খֶ͍͞श཰Λ࣋ͭNNͷޯ഑߱ԼʹΑΔֶश͸ɺ େ͖ͳ෯ͷઢܗԽʹΑͬͯे෼ʹۙࣅ͞ΕΔ(ӈਤ) ηcritical [Contribution 1/2] Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent from "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent", Fig. 4ʢӈ), J. Lee+, NeurIPS 2019
  52. C. ग़ྗ෼෍ͷμΠφϛΫεɿ ޯ഑߱Լ๏ʹΑΔֶशதͷNNͷ༧ଌ͸ɺ෯͕ແݶେʹۙͮ͘ͱGP(Gaussian Process)ʹऩଋ͢Δ͜ͱΛࣔͨ͠(ࠨਤ) => ֶशதͷ͜ͷGPͷਐԽͷ࣌ؒґଘੑͷදݱΛ໌ࣔతʹಋग़ ɹ ֶशʹؔΘΔཧ࿦ [ֶशμΠφϛΫε] 55

    from "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent", Fig. 2(ࠨ), J. Lee+, NeurIPS 2019 [Contribution 2/2] Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent NNGP: Nearest Neighbor Gaussian Processes
  53. ɹ ֶशʹؔΘΔཧ࿦ [ֶशμΠφϛΫε] 56 Dynamics of stochastic gradient descent for

    two-layer neural networks in the teacher-student setup [֓ཁ] • ೋ૚ͷNN2ͭΛTeacher-Studentͷ໰୊ઃఆͱͯ͠ѻ͏ ೖྗ͸ಉ͡ɺTeacher͸ֶशࡁΈϞσϧͰɺStudent͕ Teacherͷೖྗʹؔ͢Δग़ྗΛֶश • ͦΕͧΕͷNNͷॏΈΛʮۙ͞ͷई౓ʯͰਤΓɺ OverParameterizedͳNNֶशͷμΠφϛΫεΛݚڀ Student Teacher [Contribution] • ΦϯϥΠϯֶशͷ֬཰తޯ഑߱Լ๏͕ODEʹ઴ۙ͢Δ͜ ͱΛOverParameterizedͳঢ়گͰূ໌ • ׆ੑԽؔ਺ͷҧ͍ʹΑͬͯ SGD ͕ٻΊΔղ͕ҟͳΔ͜ͱ Λࣔࠦɺ·ͨɺ൚ԽੑೳͷΞϧΰϦζϜɺϞσϧΞʔΩς Ϋνϟɺσʔληοτͷ૬ޓ࡞༻΁ͷґଘΛࣔͨ͠ from "Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup", Fig. 1, S. Goldt+, NeurIPS 2019 05 Teacher Student Setupͷ֓೦ਤ
  54. ɹ ֶशʹؔΘΔཧ࿦ [ֶशμΠφϛΫε] 57 Dynamics of stochastic gradient descent for

    two-layer neural networks in the teacher-student setup •Teacher-Studentܗࣜͷೋ૚ͷNNΛ༻͍ͨ໰୊ઃఆ •Student͸SGDʹΑֶͬͯशͤ͞ɺTeacherͱStudentͷதؒ૚ͷϊʔυͷݸ਺Λม࣮͑ͯݧ •M = teacherͷதؒ૚ͷϊʔυ਺, K = studentͷதؒ૚ͷϊʔυ਺ => L = K - M > 0 ͱͯ͠ L → ∞ ͰOver-Parameterization Λߟ͑Δ [࣮ݧઃఆ] ϵg (mμ) = ϵg (Rμ, Qμ, T, v*, vμ) where Qμ ik ≡ wμ i wμ k N , Rμ kn ≡ wμ k w* n N , Tnm ≡ w* n w* m N ͱஔ͘ ҰൠԽޡࠩΛԼهΑ͏ʹఆٛ { x : ೖྗ, ϕ(x, θ) : Ϟσϧͷग़ྗ θ : studentͷύϥϝʔλ܈, θ* : teacherͷύϥϝʔλ܈ ϵg (θ, θ*) ≡ x|θ,θ* 1 2 [ϕ(x, θ) − ϕ(x, θ*)]2 มܗ Teacher Student from "Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup", Fig. 1, S. Goldt+, NeurIPS 2019
  55. ɹ ֶशʹؔΘΔཧ࿦ [ֶशμΠφϛΫε] 58 Dynamics of stochastic gradient descent for

    two-layer neural networks in the teacher-student setup [࣮ݧ݁Ռ/1Layer໨ͷΈΛֶशͨ͠৔߹, SCM:Soft Committee Machine] ※ μ : εςοϓ਺, N : ೖྗͷ࣍ݩ Λ૿Ճͤͨ͞ࡍͷҰൠԽޡࠩͷਪҠ L => K=4(=M)ͷͱ͖࠷΋Teacherʹ͍ۙ => ͕େ͖͘ͳΕ͹Teacher͔Β཭ΕΔ L M=2,K=5ͷӅΕ૚ͷղऍਤ StudentͷMݸͷϊʔυ͕ TeacherͷKݸͷϊʔυͱରԠ =>ෆඞཁͳLͷ෦෼ͷӨڹʹ ΑΓ൚Խੑೳ͕Լ͕Δ ֶशͷܦա ͱҰൠԽޡࠩͷؔ܎ α ≡ μ N from "Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup", Fig. 2, S. Goldt+, NeurIPS 2019
  56. ɹ ֶशʹؔΘΔཧ࿦ [ֶशμΠφϛΫε] 59 Dynamics of stochastic gradient descent for

    two-layer neural networks in the teacher-student setup [࣮ݧ݁Ռ/1Layer, 2LayerΛ྆ํֶशͨ͠৔߹] ※Ϟσϧͷ׆ੑԽ͸Sigmoid 3ͭͷֶशํ๏Ͱͷ ͱҰൠԽޡࠩͷؔ܎ K => ྆૚Λֶशͤ͞Ε͹ɺKΛେ͖ͯ͘͠΋ Teacherʹۙͮ͘ M=2,K=5ͷӅΕ૚ͷղऍਤ StudentͷKݸͷϊʔυશ͕ͯTeacherͷ ϊʔυͷ໾ׂʹରԠ => ҰൠԽޡ͕ࠩখ͘͞ͳΔ ྆૚ͷֶश ͱҰൠԽޡࠩͷؔ܎ Z ≡ K M SCM͸ୈ1૚ͷΈͷֶशɺBoth͸྆૚ͷֶशɺ Normalized͸SCM+ਖ਼ଇԽ => NormalizedͰ͸ɺୈ1૚Λ܇࿅͠ɺୈ2૚ Λݻఆɺޡ͕ࠩݮগ ※Ϟσϧͷ׆ੑԽ͸ReLU [࣮ݧ݁Ռ/3छͷֶश๏ൺֱ] ൚ԽΛཧղ͢Δʹ͸ɺ࠷దԽͱΞʔΩςΫνϟͷ૬ޓ࡞༻Λཧղ͢Δඞཁ͕͋Δ from "Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup", Fig. 3(ࠨ). 4(ӈ), S. Goldt+, NeurIPS 2019
  57. ɹ ֶशʹؔΘΔཧ࿦ 60 ֶशʹؔΘΔཧ࿦ͷࡉ෼Խ Training Dynamics of DNN Optimizer Characteristics

    Initialization Learning RateɾMomentumɾDataAugmentation RegularizationɾNormalization Training Loop Initialization Mini-Batch SGD Training w/ Regularization
  58. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 61 [֓ཁ] • ࠷దԽख๏ͷҧ͍ʹΑΔΫϦςΟΧϧόοναΠζ*ͷมԽΛௐ΂ͨ •ௐࠪର৅͸ɺSGD, momentum SGD,

    Adam, K-FAC, ࢦ਺ฏ׈Ҡಈ ฏۉ๏ʢEMAʣ •ௐࠪํ๏͸ɺNQM (Noisy Quadratic ModelʣʹΑΔղੳͱ େن໛ͳ࣮ݧ • AdamͱK-FACͰ͸ɺଞͷख๏ͱൺֱͯ͠ΫϦςΟΧϧόοναΠ ζ͕͸Δ͔ʹେ͖͘ͳΔ • NQM͕ɺ࠷దԽख๏ʹؔͯ͠ɺΫϦςΟΧϧόοναΠζΛ༧ଌ͢ Δͷʹ༗༻ͳπʔϧͰ͋Δ͜ͱΛࣔͨ͠ *ΫϦςΟΧϧόοναΠζɿ όοναΠζ૿Ճʹ൐͏ֶश཰ͷ૿Ճ->൓෮ճ਺ͷ࡟ݮͷݶքαΠζ Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model from "Measuring the Effects of Data Parallelism on Neural Network Training“, Fig. 1, C. J. Shallue+, NeurIPS 2019 06 ResNet-50 ImageNet-1Kʹ͓͚Δ
 BSͱඞཁ൓෮਺ͷؔ܎
  59. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 66 Which Algorithmic Choices Matter at Which

    Batch Sizes? Insights From a Noisy Quadratic Model [ݚڀഎܠ / ϥʔδόονֶशͱ͸]
  60. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 67 Which Algorithmic Choices Matter at Which

    Batch Sizes? Insights From a Noisy Quadratic Model [ݚڀഎܠ / ϥʔδόονֶशͱ͸]
  61. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 68 Which Algorithmic Choices Matter at Which

    Batch Sizes? Insights From a Noisy Quadratic Model [ݚڀഎܠ / ϥʔδόονֶशͱ͸]
  62. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 69 Which Algorithmic Choices Matter at Which

    Batch Sizes? Insights From a Noisy Quadratic Model [ݚڀഎܠ / ϥʔδόονֶश = ಉظܕσʔλฒྻ෼ࢄਂ૚ֶश] e.g. batch size = 1 e.g. batch size = 3 ϥʔδόονֶशͱεϞʔϧόονֶशͷ֓೦ਤ
  63. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 70 Which Algorithmic Choices Matter at Which

    Batch Sizes? Insights From a Noisy Quadratic Model [ݚڀഎܠ / ϥʔδόονֶश = ಉظܕσʔλฒྻ෼ࢄਂ૚ֶश] Linear Scaling Rule όοναΠζΛnഒʹ͢Δͱֶश཰΋nഒ͢Δɺֶश཰ௐ੔๏ େ͖ͳόοναΠζΛ࢖༻͢Δ৔߹ʹޮ཰తͳऩଋ࣌ؒΛ࣮ ݱ͢ΔͨΊʹҰൠʹ༻͍ΒΕΔ [P. Goyal+, 2017] Linear Scaling Rule ͷ֓೦ਤ εϞʔϧόονֶशͱϥʔδόονֶशͷҧ͍ Grad(θ1,x1)
  64. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 71 Which Algorithmic Choices Matter at Which

    Batch Sizes? Insights From a Noisy Quadratic Model [ݚڀഎܠ / ϥʔδόονֶशͷ໰୊఺]
  65. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 72 NQMͷղੳʹ͓͍ͯ͸ɺଛࣦؔ਺͕ޯ഑ͷϊΠζΛ൐͏ತೋ࣍ؔ਺Ͱ͋ΔͱԾఆ͢Δ ೋ࣍ܗ͕ࣜର֯Ͱ͋Γ࠷దղ(optimum)͕ݪ఺ʹ͋Δ͜ͱΛԾఆ͢Δɺίετؔ਺ ͱޯ഑ ͸ҎԼͷΑ͏ʹͳΔ (ͦͷଞʹ΋࿦จதͰࣗ໌ɾඇࣗ໌ͳԾఆ͕͋Δ͕ɺͦΕΒͷਖ਼౰ੑ΋࿦จதͰओு) L(θ)

    gB (θ) ࣍ݩ ʹ͓͚Δظ଴ଛࣦ͸ҎԼͷΑ͏ʹͳΔ i [NQM (Noisy Quadratic Model ] Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model L(θ) = 1 2 θTHθ = 1 2 d ∑ i=1 hi θ2 i ≜ l(θi ) gB (θ) = Hθ + ϵB , Cov(ϵB ) = C B ※ θ : ύϥϝʔλ, H : ϔοηߦྻ, ϵ : ෼ࢄ, C : ڞ෼ࢄߦྻ, B : όοναΠζ [l(θi (t))] = (1 − αhi )2t [l(θi (0))] + (1 − (1 − αhi )2t) αci 2B(2 − αhi ) Convergence Rate Steady State Risk SGD ͷ࣍ݩ i ͷֶशΛԼهͷΑ͏ʹ͓͘ͱ ֤࣍ݩ͕Steady State Riskʹࢦ਺ؔ਺తʹऩଋ ͯ͠͠·͏ͨΊɺ ΑΓߴֶ͍श཰͸ Convergence RateΛߴ ΊΔ͕ɺSteady State RiskΛߴΊͯ͠·͏ͱ͍ ͏τϨʔυΦϑͷؔ܎
  66. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 73 Momentum SGDͰͷ࣍ݩ ʹ͓͚Δظ଴ଛࣦ͸ҎԼͷΑ͏ʹͳΔ i ͕ͨͬͯ͠ҎԼͷ͜ͱ͕Θ͔Δ [Momentum

    SGD] Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model β : ׳ੑύϥϝʔλ, r1 , r2 : x2 − (1 − αhi + β)x + β = 0ͷղ (r1 ≥ r2 ) [l(θi (t))] ≤ ( (rt+1 1 − rt+1 2 ) − β(rt 1 − rt 2 ) r1 − r2 )2 [l(θi (0))] + α 1 − β ci 2B(2 − αhi 1 + β ) • εϞʔϧόονֶशʢֶश཰͕খ͍͞ʣ৔߹͸Convergence RateͱSteady State Risk ͕௨ৗͷSGDͱ΄΅มΘΒͳ͍ • ௨ৗͷSGDΑΓֶश཰ͷ্ݶ͕ߴ͍ =>ϥʔδόονֶशͰͷߴ଎Խ͕ݟࠐΊɺΫϦςΟΧϧόοναΠζͷ૿Ճ΋ظ଴Ͱ͖Δ Momentum SGDͰͷߋ৽نଇ͸Լهͷͱ͓Γɺ
  67. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 74 Adam΍K-FACͳͲ͸લॲཧ෇͖SGDͱΈͳ͢͜ͱ͕Ͱ͖Δ PreconditionerΛ ͱͯ͠ɺͦͷߋ৽ࣜ͸Լهͷͱ͓Γ P = Hp

    (0 ≤ p ≤ 1) Steady State RiskΛݮগͤ͞ΔҰͭͷํ๏͸ɺόοναΠζͷ૿Ճ => όοναΠζ͕େ͖͍΄ͲΑΓڧྗͳલॲཧ෇͖ख๏Λ࢖༻͢Δ͜ͱͷར఺Λαϙʔτ [લॲཧ෇͖SGD / Preconditioned Gradient Descent Methods (such as Adam, K-FAC) ] Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model [l(θi (t))] ≈ 1 − αh1−p i )2t [l(θi (0))] + αci h−p i 2B(2 − αh1−p i ) ࣍ݩ ʹ͓͚Δظ଴ϦεΫ͸ҎԼ௨Γɺͨͩ͠ill-conditioned loss surfaceΛԾఆ͢ΔͱɺSteady State Risk͸ӈه i pʹؔͯ͠୯ௐ૿Ճͷؔ਺
  68. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 75 EMAͷߋ৽ࣜٴͼ࣍ݩ ʹ͓͚Δظ଴ϦεΫ͸ҎԼͷΑ͏ʹͳΔ i ฏۉԽ܎਺ Λద੾ʹઃఆͯ͠ ͱ͢Δ͜ͱͰConvergence

    RateΛ௿Լͤ͞Δ͜ͱͳ͘ Steady State RiskΛ௿ݮՄೳ => ֶशͷ൓෮਺ͷ࡟ݮ͕ՄೳͱͳΔ γ r1 > r2 [ࢦ਺ฏ׈Ҡಈฏۉ๏ (EMA)] Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model γ : ฏۉԽ܎਺ (0 ≤ γ < 1), r1 = 1 − αhi , r2 = γ [l(θi (t))] ≈ ( (rt+1 1 − rt+1 2 ) − γ(1 − αhi )(rt 1 − rt 2 ) r1 − r2 )2 [l(θi (0))] + αci 2B(2 − αhi ) (1 − γ)(1 + (1 − αhi )γ) (1 + γ)(1 − (1 − αhi )γ) θ(t + 1) ← θ(t) − αgB (θ(t)) ˜ θ(t + 1) ← γ˜ θ(t) + (1 − γ) θ(t + 1)
  69. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 76 NQMʹΑΔ༧ଌ͕࣮ࡍʹอ࣋͞Ε͍ͯΔ͔Λɺ5ͭͷNNΞʔΩςΫνϟΛ༻͍ͯௐࠪ [࣮ݧઃఆ] ࢖༻ͨ͠Ϟσϧ͸ҎԼͷ௨Γ ΦϓςΟϚΠβʔ͸ɺSGDɺMomentum SGDɺAdam (Momentum͋Γɺͳ͠ʣɺK-FACʢMomentum͋Γɺͳ͠ʣ

    ໨ඪਫ਼౓ʹ౸ୡ͢Δ·Ͱͷεςοϓ਺Λଌఆ Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model ·ͨɺEMAͱSGDͷൺֱ΋ߦͬͨ from "Measuring the Effects of Data Parallelism on Neural Network Training“, Table. 1, C. J. Shallue+, NeurIPS 2019
  70. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 77 [࣮ݧ݁Ռ] Which Algorithmic Choices Matter at

    Which Batch Sizes? Insights From a Noisy Quadratic Model ֤Optimizerʹ͓͚ΔόοναΠζͱऩଋʹඞཁͳεςοϓ਺ͷؔ܎ EMAͷޮՌʹ͍ͭͯ from "Measuring the Effects of Data Parallelism on Neural Network Training“, Fig. 5(ࠨ). 6(ӈ), C. J. Shallue+, NeurIPS 2019
  71. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 78 Fast Convergence of Natural Gradient Descent

    for Over- Parameterized Neural Networks • ඇઢܗNNͷֶशʹ͓͍ͯɺॳΊͯࣗવޯ഑߱Լ๏(NGD)ͷେҬతͳऩଋอূΛͨ͠ • ಉֶ͡श཰Ͱ͋Ε͹ɺࣗવޯ഑๏ͷํ͕ޯ഑߱Լ๏ΑΓ΋O(λmin(G))ഒ଎͘ऩଋ͢Δ ※G͸σʔλʹґଘ͢ΔάϥϜߦྻ => ࣗવޯ഑๏ͷ৔߹ɺΑΓେ͖ͳεςοϓαΠζΛ༻͍Δ͜ͱ͕ՄೳͱͳΓɺ݁Ռͱͯ͠ऩଋ཰͕͞Βʹ޲্ • ࣗવޯ഑๏ͷۙࣅख๏K-FAC [Martens and Grosse, 2015] ʹ͓͍ͯ΋Linear RateͰେҬ࠷খ஋ʹऩଋ͢Δ => ͔͠͠ɺ͜ͷ݁Ռ͸GD΍ݫີNGDͱൺֱͯ͠ɺߋͳΔOver-ParameterizedԽΛඞཁ • NGDͷ൚ԽੑೳΛղੳ͠ɺऩଋϨʔτͷ޲্͸ҰൠԽͷѱԽΛ٘ਜ਼ʹ͠ͳ͍͜ͱΛࣔͨ͠ from "Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks”, Fig. 1, G. Zhang+, NeurIPS 2019 [֓ཁ] 07
  72. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 79 Fast Convergence of Natural Gradient Descent

    for Over- Parameterized Neural Networks From Matt Johnson, Daniel Duckworth, K- FAC and Natural Gradients, 2017 [ݚڀഎܠ] ࣗવޯ഑๏ • ৘ใزԿʹجͮ͘[S. Amari,1998]ʹΑΓఏҊ͞Εͨ࠷దԽख๏ • ϑΟογϟʔ৘ใߦྻΛϦʔϚϯܭྔ(= ۂ཰ߦྻ)ͱͯ͠༻͍Δ ਂ૚ֶशʹ͓͚Δࣗવޯ഑๏ͷϝϦοτ ۂ཰ͷ৘ใΛ༻͍ͯਖ਼͍͠ํ޲΁ͷߋ৽͕ظ଴͞ΕΔ => SGD(ͷվྑख๏)ͱൺ΂ɺগͳ͍൓෮਺Ͱऩଋ͢Δ ↓ϑΟογϟʔ৘ใߦྻ ࣗવޯ഑๏(NGD) ޯ഑๏ SGD NGD ֬཰తޯ഑߱Լ๏(SGD) ໨తؔ਺ͷޯ഑ˠ ೋ৐ޡࠩͷଛࣦؔ਺ F͸ɺ΄ͱΜͲͷ৔߹ ҰൠԽΨ΢εɾχϡʔτϯ ߦྻͱ౳Ձ(H=IͱԾఆ) ※ J͸ϠίϏߦྻ Λߟ͑Δ NGDͱSGDͷTrajectoryͷΠϝʔδਤ
  73. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 80 Fast Convergence of Natural Gradient Descent

    for Over- Parameterized Neural Networks [Key Idea : ࣗવޯ഑ͷग़ྗۭؒͰͷޯ഑΁ͷۙࣅ] ա৒ύϥϝʔλԽ͞ΕͨNNͰ͸ɺFisher৘ใߦྻ( )͸ਖ਼ଇͰ͸ͳ͍ͨΊɺٖࣅٯߦྻ Λ࢖༻ n × m, n ≪ n F† ͕ਖ਼ఆஔͰ͋ΔͱԾఆ͢Δͱɺࣗવޯ഑͸ग़ྗۭؒͷޯ഑ͱͳΔ G(k) = J(k)J(k)⊤ ※ ͸NNͷ༧ଌग़ྗ u(k) ࣗવޯ഑๏ʹΑΔύϥϝʔλߋ৽ ͕ఆ਺ͱԾఆ͢Δͱɺࣗવޯ഑๏͸ग़ྗۭؒͷޯ഑߱Լ๏ͱΈͳͤΔ J(k)
  74. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 81 Fast Convergence of Natural Gradient Descent

    for Over- Parameterized Neural Networks [ࣗવޯ഑๏ͷऩଋղੳ] ্هͷ2ͭͷڧ͍৚݅ԼͰԼهͷ͜ͱ͕ݴ͑Δ • ઢܗ଎౓ͰେҬऩଋ • ෯͕ແݶେʹͳΔʹͭΕֶͯश཰ η→1 • ΫϩεΤϯτϩϐʔଛࣦͷΑ͏ͳଞͷଛࣦؔ਺ʹҰൠԽՄೳ • ऩଋϨʔτ͸ֶशσʔλʹඇґଘ from "Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks", G. Zhang+, NeurIPS 2019
  75. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 82 Fast Convergence of Natural Gradient Descent

    for Over- Parameterized Neural Networks [Over-Parameterizedͳ৔߹ͷࣗવޯ഑๏ͷऩଋղੳ] 2-LayerͷReLUͷNNΛߟ͑Δ from "Fast Convergence of Natural Gradient Descent for Over- Parameterized Neural Networks", G. Zhang+, NeurIPS 2019 ※ ͸ग़ྗ૚ͷr൪໨ͷॏΈ ar े෼େ͖ͳNNͷ෯ΛऔΕ͹ɺࣗવޯ഑๏Ͱ͸ֶश཰Λ1ʹͯ͠ɺΦʔμʔ1stepͰֶश͕Մೳͳ ͜ͱΛࣔͨ͠ => ޯ഑߱Լ๏ΑΓ΋ྑ͍ऩଋੑೳ
  76. ϑΟογϟʔ৘ใߦྻͷ ϒϩοΫର֯ۙࣅ K-FACͷۙࣅख๏ ΫϩωοΧʔҼࢠ෼ղΛ ༻͍ͨظ଴஋ۙࣅ ϑΟογϟʔ৘ใߦྻͷ ٯٖࣅߦྻ ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ]

    83 Fast Convergence of Natural Gradient Descent for Over- Parameterized Neural Networks [ࣗવޯ഑๏ͷۙࣅख๏K-FACͷऩଋղੳ / K-FACͷ঺հ] from "Optimizing Neural Networks with Kronecker- factored Approximate Curvature ", Fig. 5, . 6, J. Martens+, ICML 2015 ਂ૚ֶशʹ͓͚Δࣗવޯ഑๏ͷ໰୊఺ • ๲େͳύϥϝʔλ(N)ʹର͠ɺڊେͳϑΟογϟʔ৘ใߦྻ (N x N)ͷٯߦྻܭࢉ͕ඞཁ e.g. ResNet-50Ͱ͸໿ N=3.5×106 (໿12PBͷϝϞϦফඅ) ϑΟογϟʔ৘ใߦྻ(ٴͼٯߦྻ)Λۙࣅ͢Δख๏ ΫϩωοΧʔҼࢠ෼ղΛ༻͍ͨۙࣅΛߦ͏K-FAC(Kronecker- Factored Approximate Curvature) [J. Martens et al. 2015] => K-FACͰ͸2ஈ֊ͷۙࣅΛߦ͏͜ͱͰϝϞϦফඅྔͱܭࢉྔ Λ཈͍͑ͯΔ ໨తؔ਺ͷޯ഑ : ϑΟογϟʔ৘ใߦྻ ࣗવޯ഑๏(NGD)
  77. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 84 Fast Convergence of Natural Gradient Descent

    for Over- Parameterized Neural Networks [ࣗવޯ഑๏ͷۙࣅख๏K-FACͷऩଋղੳ] from "Fast Convergence of Natural Gradient Descent for Over- Parameterized Neural Networks", G. Zhang+, NeurIPS 2019 Over-ParameterizedͳNNͷֶशʹ͓͍ͯɺK-FACͰ΋େҬతऩଋΛอূ ޯ഑߱Լ๏Ͱ͸ऩଋRate͸άϥϜߦྻGͷ৚݅਺ʹΑܾͬͯఆ͞ΕΔ͕ɺ ࣗવޯ഑๏͸ ͷ৚݅਺ʹґଘ X⊤X
  78. ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 85 Fast Convergence of Natural Gradient Descent

    for Over- Parameterized Neural Networks [ࣗવޯ഑๏ͷ൚Խੑೳղੳ] ࣗવޯ഑๏͸ޯ഑߱Լ๏ͱ΄΅ಉ͡ଛࣦͷBoundͰɺແݶ෯Ͱ͸ɺ྆ऀ͸ಉ͡ղʹऩଋ͢Δ͜ͱΛࣔͨ͠ NGD΍ଞͷલఏ৚݅෇͖ޯ഑߱Լ๏͸ɺ൚Խੑೳͷ఺Ͱޯ഑߱Լ๏ΑΓ΋ੑೳ͕ྼΔͱਪଌ͞Ε͍ͯΔ => ͜ͷ࿦จͰɺࣗવޯ഑๏͕ޯ഑߱Լ๏ͱಉ༷ʹ൚Խ͢Δ͜ͱΛূ໌ from "Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks", G. Zhang+, NeurIPS 2019
  79. • Tronto Vector Institute ͷ G. Hinton άϧʔϓ͔Β৽͍͠࠷దԽख๏ Lookahead ΛఏҊ

    • SGD͸γϯϓϧ͕ͩੑೳ͕ߴ͘ɺͦͷೋछྨͷ೿ੜख๏΋޿͘࢖ΘΕ͍ͯΔ 1. దԠతֶश཰ܕ : Adam, AdaGradͳͲ 2. Ճ଎๏ܕ : Nesterov Momentum, Polyakheavy-ballͳͲ ※ ͜ΕΒͷੑೳΛҾ͖ग़͢ʹ͸ϋΠύʔύϥϝʔλͷߴ౓ͳνϡʔχϯά͕ඞཁ • ্هΛղܾ͢ΔɺLookahead ͷڧΈ͸Լهͷೋͭ a) ֶशͷ҆ఆԽɾߴ଎Խ / b) ϋΠύʔύϥϝʔλʹؔͯ͠ϩόετ [֓ཁ] ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 86 Lookahead Optimizer: k steps forward, 1 step back from "Lookahead Optimizer: k steps forward, 1 step back", Fig. 5(ࠨ) Table. 1(ӈ), M. Zhang+, NeurIPS 2019 08
  80. • Polyak Averaging : ತ࠷దԽͷߴ଎Խख๏ͱͯ͠ఏҊ 1. ॏΈͷࢉज़ฏۉΛͱΔ͜ͱͰɺΑΓߴ଎ͳ ತ࠷దԽͷऩଋ 2. NNֶशʹ͓͚ΔॏΈฏۉԽ͕࠷ۙ஫໨͞Ε͍ͯΔ

    • SWA; Stochastic Weight Averaging => NNͷॏΈΛҟͳΔֶश࣌఺Ͱͷαϯϓϧ͔ΒฏۉԽͯ͠ΞϯαϯϒϧΛ࡞੒͠(্ਤ)ɺߴ͍ੑೳΛൃش • Regularized Nonlinear Acceleration (RNA) 1. ޯ഑͕θϩʹͳΔ఺Λݟ͚ͭΑ͏ͱ͢ΔΞϧΰϦζϜ 2. ௚ۙͷkճͷ൓෮ʹج͍ͮͯҰ࣍ܥΛղ͕͘ɺϝϞϦফඅͱܭࢉྔ͕kഒʹͳΔ [ݚڀഎܠ] ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 87 from "Averaging Weights Leads to Wider Optima and Better Generalization", Fig. 1, P. Izmailov+, UAI 2018 Lookahead Optimizer: k steps forward, 1 step back SGDͱSWAͷղͷೋ࣍ݩPlotਤ
  81. • Weight Λ Slow Weight ͱ Fast Weight ʹ෼͚ෳ੡͢Δ •

    Fast Weight Λ kճߋ৽ޙ(Fast Weightͷߋ৽͸SGDͰ΋AdamͰ΋ྑ͍)ɺSlow Weight Λߋ৽͢Δ => ߋ৽ઌΛߟྀͨ͠ߋ৽ଇΛಋೖ • Slow Weight Λ Evaluation ʹ͸༻͍Δ [Lookahead Optimizer] ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 88 Lookahead Optimizer: k steps forward, 1 step back from "Lookahead Optimizer: k steps forward, 1 step back", M. Zhang+, NeurIPS 2019 Lookagead Optimizerͷ୳ࡧํ๏֓೦ਤ
  82. • Fast Weight ͷߋ৽͸ɺબ୒͢Δ࠷దԽΞϧΰϦζϜʹґ ଘ͢Δ(Լਤ: ֶशͷաఔͰAccuracyͷมԽ͕େ͖͍) • Slow Weight ͸಺෦ͷϧʔϓ(Fast

    Weight ͷkճͷߋ৽)ͷ ॏΈͷࢦ਺ҠಈฏۉʢEMAʣͱͯ͠ղऍͰ͖Δ => ෼ࢄ௿ݮͷޮՌ͕͋Δ [Lookahead Optimizerͷੑ࣭] ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 89 Lookahead Optimizer: k steps forward, 1 step back from "Lookahead Optimizer: k steps forward, 1 step back", Fig. 1(ࠨ). 10(ӈ), M. Zhang+, NeurIPS 2019
  83. ଛࣦͷظ଴஋͸ԼهͷΑ͏ʹද͞ΕΔ NQM; Noisy Quadratic Model [J. Martens+, NeurIPS2019] ͷߟ͑Λ΋ͱʹɺ෼ࢄʹண໨ Lookahead

    ͱ SGD ͷऩଋڍಈΛൺֱ => , Λର֯ߦྻͱ͠ɺͦͷର֯ཁૉΛͦΕͧΕ ͱͯ͠ɺϞσϧΛԼهͷΑ͏ʹఆٛ A Σ ai , σ2 i [ऩଋղੳ] ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 90 Lookahead Optimizer: k steps forward, 1 step back ※ ͱ͢Δ c ∼ (0,Σ) ֶश཰Λ ͱ͠ɺϊΠζͷଟ͍2࣍ϞσϧͰ͸ɺLookaheadͷ಺෦ͷOptimizerΛSGD ͱ͢ΔͱɺLookaheadɺSGD྆ऀͱ΋ظ଴஋͕ 0 ʹऩଋ͠ɺ෼ࢄ͸࣍ͷݻఆ఺ʹऩଋ 0 < γ < 2 L (L = maxi ai) α∈(0,1)ͷ࣌ɺ ͷӈลͷୈҰ߲͸ඞͣ1ҎԼʹͳΔ => Lookahead ͷํ͕෼ࢄ͕খ͍͞ V* LA
  84. [࣮ݧ݁Ռ] ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 91 Lookahead Optimizer: k steps forward,

    1 step back from "Lookahead Optimizer: k steps forward, 1 step back", Fig. 5(ࠨ্). 8(ӈ্)(ࠨԼ). 9(ӈԼ)M. Zhang+, NeurIPS 2019 Lookahead͸ଞͷOptimizerΑΓߴ͍ੑೳ Lookahead͸ଞͷOptimizerΑΓ௿͍ ϋΠύʔύϥϝʔλґଘੑ
  85. [࣮ݧ݁Ռ] ɹ ֶशʹؔΘΔཧ࿦ [Optimizerಛੑ] 92 Lookahead Optimizer: k steps forward,

    1 step back from "Lookahead Optimizer: k steps forward, 1 step back", Fig. 6(ࠨ্). 7(Լ)(ӈ), M. Zhang+, NeurIPS 2019 WMT14 English-to-German task with Transformer Model Lookahead͸ଞͷOptimizerΑΓߴ͍ੑೳ Λࣔ͢͜ͱΛ࣮ݧͰ΋֬ೝ
  86. ɹ ֶशʹؔΘΔཧ࿦ 93 ֶशʹؔΘΔཧ࿦ͷࡉ෼Խ Training Dynamics of DNN Optimizer Characteristics

    Initialization Learning RateɾMomentumɾDataAugmentation RegularizationɾNormalization Training Loop Initialization Mini-Batch SGD Training w/ Regularization
  87. ɹ ֶशʹؔΘΔཧ࿦ [ॳظԽ] 94 MetaInit: Initializing learning by learning to

    initialize • ޮՌతʹֶश͢ΔͨΊͷॳظԽΛֶश͢Δํ๏ΛఏҊ(ॳظԽͷࣗಈԽͷୈҰาͱͯ͠Ґஔ෇͚) => (1)ॏΈͷNormɺ(2)ۂ཰ͷޯ഑ʹର͢ΔӨڹΛௐ੔͢Δ͜ͱ͕Մೳ • (1)ͳͥ Normʹண໨͢Δͷ͔ʁ => ʮશ݁߹૚ͰͷॏΈ΍όΠΞεͷNorm͸ޯ഑͕ൃࢄ/ফࣦ͢Δ͔Ͳ͏͔Λ੍ޚ͢ΔʯԾઆ*1ͷ΋ͱɺ Normͷௐ੔Λ͢Δண૝ • (2)ͳͥɺۂ཰ͷޯ഑ʹର͢ΔӨڹʹண໨͢Δͷ͔ʁ => ʮܭࢉίετͷ໘Ͱɺೋ࣍࠷దԽΛ༻͍ͳ͍৔߹ɺۂ཰ʹΑΔޯ഑ʹରͯ͠ͷӨڹ͕গͳ͍ྖҬͰֶशΛ ࢝ΊΔ͜ͱ͕ྑ͍ॳظԽͷͨΊʹॏཁͱͳΔʯͱ͍͏ԾઆΛཱ͓ͯͯΓɺ࣮ݧʹͯݕূɺԾઆͷࠜڌ͸*2 *1: Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010. *2: ҎԼͷ͔̏ͭΒԾઆ͕ੜ·ΕΔ - David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 342–350. JMLR. org, 2017. - Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, pages 4785–4795, 2017. - George Philipp and Jaime G Carbonell. The nonlinearity coefficient-predicting generalization in deep neural networks. arXiv preprint arXiv:1806.00179, 2018. [֓ཁ] 09
  88. Gradient Quotient(ҎԼGQ)ͱݺ͹ΕΔޯ഑มԽྔͷࢦඪΛఏҊ => ׬શͳϔγΞϯΛܭࢉ͢Δͷʹଟେͳඅ༻Λ͔͚Δ͜ͱͳ͘, θ෇ۙͷۂ཰ͷӨڹΛଌఆ͢ΔྔΛߏங͍ͨ͠ͱ͍͏ಈػ GQ͸ύʔύϥϝʔλθͷ”࣭”Λද͢ࢦඪͱͯ͠ղऍͰ͖ɺޯ഑͕Ұఆͷ࣌GQ=0 ॳظԽʹ͓͍ͯɺGQΛԼ͛Δํ޲ʹޯ഑߱Լ๏ͰMetaStepsͱͯ͠ ࣄલʹֶशͤ͞Δ͜ͱͰɺॳظԽΛվળ͢Δ͜ͱ͕Մೳ => (2)ॳظͷ਺εςοϓʹ͓͍ͯͷۂ཰ͷޯ഑΁ͷӨڹɺ(1)ϊϧϜΛௐ੔

    GQʹؔ͢Δ࣮ݧͰ͸ɺGQࣜͰ͸L1ਖ਼ଇԽΛ͍ͯ͠Δ͕ɺ L2ਖ਼ଇԽͷ৔߹͸Lossͷऩଋ͕஗͍(ӈ্ਤ) ·ͨɺMetaInitΛ༻͍ͨ৔߹͸ଞͷॳظԽํ๏ΑΓ΋ૣ͘ऩଋ͢Δ(ӈԼਤ) ɹ ֶशʹؔΘΔཧ࿦ [ॳظԽ] 95 MetaInit: Initializing learning by learning to initialize GQ(L, θ) = 1 N ∥ H(θ)g(θ) g(θ) ∥1 = 1 N ∥ Σi λi ci (eT j vi ) Σi ci (eT j vi ) ∥ from "MetaInit: Initializing learning by learning to initialize", Fig. 1(ࠨ). 4(ӈ্). 5(ӈԼ), Y. Dauphin+, NeurIPS 2019 [ఏҊख๏] λi: ݻ༗஋, vi: ݻ༗ϕΫτϧ, ej: ඪ४جఈϕΫτϧ ci: ԼهఆٛͱͳΔΑ͏ͳεΧϥ
  89. ɹ ֶशʹؔΘΔཧ࿦ [ॳظԽ] 96 from "MetaInit: Initializing learning by learning

    to initialize", Fig. 3(্)Table. 1(Լ), Y. Dauphin+, NeurIPS 2019 [࣮ݧ݁Ռͱ݁࿦] 1. MetaInitͷNorm΁ͷޮՌଌఆ࣮ݧ(্ਤ): ੑೳ͕௿͍ͱ෼͔͍ͬͯΔॳظԽ*3Λઃఆͨ͠৔߹(੺ઢ)͕ MetaInitʹΑΔֶश(ࢵઢ)ʹΑΓɺྑ͍ॳظԽ(੨ઢ)ʹۙͮ͘ *3: Gaussian(0, σ 2 ) is a bad initialization that has nonetheless been used in influential papers like [Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012] 2. ྑ͍ॳظԽ͕ະ஌ͷ৔߹ͰMetaInitͷޮՌΛ࣮ࣔ͢ݧ(Լද) ୯ʹDeltaOrthogonalॳظԽΛͨ͠৔߹ͱɺ DeltaOrthogonalॳظԽʹMetaInitΛࢪͨ͠৔߹ͷΤϥʔ཰Λൺֱ ͞ΒʹɺSkip-ConnectionΛϞσϧؚ͕Ή৔߹ʹҰൠʹ༻͍ΒΕΔBNͱ Skip-ConnectionΛϞσϧؚ͕Ή·ͳ͍৔߹ʹ༗ޮͱ͞ΕΔॳظԽख๏Ͱ ͋ΔLSUVͱ΋ൺֱ MetaInitΛ࢖༻͢ΔͱɺMetaInitແ͠Ͱ͸ֶशෆՄೳͰ͋ͬͨॳظԽ ख๏Ͱ΋ֶशՄೳʹͳΓɺBN΍LSUVʹฒͿ΄Ͳͷ݁Ռʹ MetaInit: Initializing learning by learning to initialize
  90. ɹ ֶशʹؔΘΔཧ࿦ [ॳظԽ] 97 MetaInit: Initializing learning by learning to

    initialize [ࠓޙͷ՝୊] 1. MetaInitΛ͏·͘ద༻͢ΔͨΊʹ͸ϋΠύʔύϥϝʔλͷௐ੔͕ॏཁ 2. ϝλֶशʹ͓͍ͯͷ໨తؔ਺ޯ഑߱Լ๏͕ɺେ͖ͳϞσϧͰ͸ࣦഊ͢ΔՄೳੑ͕͋Δ => GQͷఆ্ٛɺྑ޷ͳGradientɾHessianੵΛਪఆ͢ΔͨΊʹ͸ͱͯ΋େ͖ͳόοναΠζ͕ඞཁ 3. ॳظԽΛվળ͚ͨͩ͠Ͱ͸ɺ਺஋తʹෆ҆ఆͳ໰୊͸ղܾͰ͖͍ͯͳ͍ => e.g Low Precision, w/o Normalization 4. MetaInit͸ॏΈͷॳظ஋ͷNormΛܾΊΔ͚ͩͰɺ෼෍ΛબͿͷ͸ґવͱͯ͠ਓखͰ͋Δ => Normͷௐ੔΋େ͖ͳӨڹΛ༩͑Δ
  91. ɹ ֶशʹؔΘΔཧ࿦ [ॳظԽ] 98 How to Initialize your Network? Robust

    Initialization for WeightNorm & ResNets • WeightNorm(ҎԼWN)΍ResNetΛ༻͍Δ৔߹ͷॳظԽख๏ΛఏҊ ∵ BatchNorm(ҎԼBN)Λલఏͱͨ͠ॳظԽख๏͸͋Δ͕ɺWN΍ResNetͷಛ௃͕ߟྀ͞Ε͓ͯΒͣɺଞͷઃఆͰ༗ޮ ͳॳظԽख๏Ͱ୅ସ͞Ε͍ͯΔ => Key Point ͸ɺϨΠϠʔ΁ͷೖྗͱग़ྗͷϊϧϜ͕҆ఆ͢ΔΑ͏ʹॳظԽ͢Δ͜ͱ • ఏҊख๏ʹΑΓɺ૚ͷ਺ɺֶश཰ͳͲͷϋΠύʔύϥϝʔλʹର͢ΔϩόετੑΛ࣮ݱ (2,500ճҎ্ͷ࣮ݧͰ࣮ূ) • ύϥϝʔλۭؒͷ௿͍ۂ཰ྖҬͰॳظԽ͢Δ͜ͱͰɺΑΓྑ͍൚Խੑೳͱ૬͕ؔ͋Δ͜ͱ͕஌ΒΕ͍ͯΔେ͖ͳֶश ཰Λར༻͢Δ͜ͱ͕Մೳʹ [֓ཁ] from "How to Initialize your Network? Robust Initialization for WeightNorm & ResNets", Table. 2, D. Arpit+, NeurIPS 2019 10
  92. ReLU͔Β͘Δ ɹ ֶशʹؔΘΔཧ࿦ [ॳظԽ] 99 How to Initialize your Network?

    Robust Initialization for WeightNorm & ResNets WN Λ༻͍ͨ Feedforward Network w/ ReLUʹରͯ͠ɺ [ఏҊख๏] ॏΈߦྻΛ௚ަԽ(௚ަม׵ͷੑ࣭Λར༻͢ΔͨΊ)͠ɺόΠΞε͕θϩʹͳΔΑ͏ʹॳظԽ͢Δ͜ͱΛఏҊ ֤૚ͷग़ྗͷϊϧϜ͕͋Δఔ౓ҡ࣋͞ΕΔΑ͏ɺForward Pass, Backward PassͰҟͳΔॳظԽ͕ಋग़(ࠨԼද) => ࣮ݧతʹForwardͷಋग़ͷΈ༻͍Δ͜ͱ͕ޮՌతͰ͋Δͱ݁࿦ from "How to Initialize your Network? Robust Initialization for WeightNorm & ResNets", Fig. 1, D. Arpit+, NeurIPS 2019 Scale FactorͰwͷnormΛ੍ޚ wͷํ޲Λ੍ޚ ※nl-1 ͱ nl ͸ͦΕͧΕ l-1,l ൪໨ͷ૚ͷೖྗ਺ͱग़ྗ਺
  93. ɹ ֶशʹؔΘΔཧ࿦ [ॳظԽ] 100 How to Initialize your Network? Robust

    Initialization for WeightNorm & ResNets WN Λ༻͍ͨ Residual Network w/ ReLUʹରͯ͠ɺ [ఏҊख๏] ͨͩ͠ɺB͸Residual Block୯ҐΛࣔ͢(ӈਤ) ॏΈߦྻΛ௚ަԽ(௚ަม׵ͷੑ࣭Λར༻͢ΔͨΊ)͠ɺ ֤૚ͷग़ྗͷϊϧϜ͕͋Δఔ౓ҡ࣋͞ΕΔΑ͏ͳॳظԽΛಋग़(Լද) from "How to Initialize your Network? Robust Initialization for WeightNorm & ResNets", Fig. 5(্). 1(Լ), D. Arpit+, NeurIPS 2019 ※fan-in, fan-out͸֤૚ͷೖग़ྗ਺ ResNetʹ͓͚Δ
 Skip-ConnectionͷωοτϫʔΫਤ
  94. QghjY<sIgQ[<.Ih Y]EX /IE][GY<sIgQ[<.Ih Y]EX /P]gjEkjE][[IEjQ][h .]Dkhj[Ihh[<YshQhIdjPsdIgd<g<ZIjIgh<[G/IIG IIGN]gq<gG[Ijq]gXhjPIdg]d]hIGQ[QjQ<YQv<jQ][hkEEIIGh<jjg<Q[Q[OGIIdIgNIIGN]gq<gG[Ijq]gXh <[GQhZ]gIg]Dkhjj]PsdIgd<g<ZIjIgE][NQOkg<jQ][h ]gq<gGd<hh <EXq<gGd<hh

    qPIgI/QhjPIY]hhNk[EjQ][qgj<[G[QhjPIQ[dkjQhhIj<hN]YY]qh ɹ ֶशʹؔΘΔཧ࿦ [ॳظԽ] 101 How to Initialize your Network? Robust Initialization for WeightNorm & ResNets [࣮ݧ݁Ռ] ఏҊख๏Ͱ͸ϋΠύʔύϥϝʔλߏ੒ɾωοτϫʔΫͷϨΠϠʔ਺ʹ ରͯ͠΋ɺΑΓϩόετʹ(Լਤ) طଘͷॳظԽख๏Ͱ͸ɺWN͸BNͱൺ΂ͯ൚Խੑೳʹ͕͕ࠩ͋ͬͨ ͔͠͠ɺఏҊख๏Λ༻͍Δͱֶश཰͕ߴ͍৔߹΋ಉఔ౓ʹ(ӈਤ) NX^N\ HY]R_J1NXQSY <8;0 GDs h j] /P]gjEkjE][[IEjQ][h GIIdIgNIIGN]gq<gG[Ijq]gXh ]Zd<gQh][qQjP <jEP"]gZ<YQv<jQ][ ż 7Ijg<Q[hj<jI]NjPI<gj<gEPQjIEjkgIh ][ .ŸÂÁ<[G .ŸÂÁÁ ż 7PI[ E]ZDQ[IG qQjP YI<g[Q[O g<jI q<gZkd jPI dg]d]hIG Q[QjQ<YQv<jQ][ hEPIZI I[<DYIh jPI kh<OI ]N Y<gOIg YI<g[Q[Og<jIh ż <gOIg YI<g[Q[O g<jIh PIYd gIGkEQ[O jPIOI[Ig<YQv<jQ][O<dqQjPgIhdIEjj] [Ijq]gXhqQjP <jEP"]gZ<YQv<jQ][ [QjQ<YQv<jQ][!IjP]G<[GI[Ig<YQv<jQ][<d  [kZDIg ]N d<dIgh P<pI hP]q[ jP<j / qQjP Y<gOI YI<g[Q[O g<jIh N<EQYQj<jIh NQ[GQ[O qQGIg Y]E<Y ZQ[QZ<qPQEPE]ggIY<jIqQjPDIjjIgOI[Ig<YQv<jQ][7IE]ZdkjIjPIY]OhdIEjg<Y[]gZ]NjPIIhhQ<[ <jQ[QjQ<YQv<jQ][<[GNQ[GjP<jjPIY]E<YEkgp<jkgIQhhZ<YYIhjN]gjPIdg]d]hIGhEPIZI0PQhIrdY<Q[hqPs Y<gOIgYI<g[Q[Og<jIhE<[DIkhIG jQhhIj<hN]YY]qh from "How to Initialize your Network? Robust Initialization for WeightNorm & ResNets", Fig. 2(ࠨ). 3(ӈԼ)Table. 1(ӈ্), D. Arpit+, NeurIPS 2019 [݁࿦] WNͷNetworkΛֶश͢Δ৔߹͸ɺ೚ҙͷॳظԽͰ͸ͳ͘ɺ దͨ͠ॳظԽ͕ඞཁ
  95. ɹ ֶशʹؔΘΔཧ࿦ 102 ֶशʹؔΘΔཧ࿦ͷࡉ෼Խ Training Dynamics of DNN Optimizer Characteristics

    Initialization Learning RateɾMomentumɾDataAugmentation RegularizationɾNormalization Training Loop Initialization Mini-Batch SGD Training w/ Regularization
  96. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 103 Control Batch Size and Learning Rate

    to Generalize Well: Theoretical and Empirical Evidence from "Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence”, FIg. 1, F. He+, NeurIPS 2019 [֓ཁ] • SGDʹΑΔDNNֶशͷ൚Խޡ͕ࠩɺόοναΠζ/ֶश཰ͱਖ਼ͷ૬ؔΛ࣋ͭ͜ͱΛࣔͨ͠ • ൚ԽޡࠩͱόοναΠζʹਖ਼ͷ૬ؔ • ൚Խޡࠩͱֶश཰ʹෛͷ૬ؔ • PAC-Bayes൚Խό΢ϯυΛ༻͍ͯཧ࿦తূ໌(ϋΠύʔύϥϝʔλΛߟྀͨ͠఺͕৽ن) • 1,600ݸͷϞσϧΛ༻͍࣮ͯݧ͠ɺ૬͕ؔ౷ܭతʹ༗ҙͰ͋Δ͜ͱΛࣔͨ͠ 11 όοναΠζ/ֶश཰ͷ૬ؔʹؔ͢Δ࣮ূ࣮ݧͷ݁Ռ
  97. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 104 Control Batch Size and Learning Rate

    to Generalize Well: Theoretical and Empirical Evidence SGDͷPAC-Bayes൚Խό΢ϯυ͸ҎԼͷΑ͏ʹͳΔ (ಋग़͸ෳࡶͳͷͰলུ) ҎԼͷෆ౳͕ࣜ੒Γཱͭ ೚ҙͷਖ਼ͷ࣮਺ δ ∈ (0,1) ʹ͓͍ͯগͳ͘ͱ΋ 1 − δ ͷ֬཰Ͱ ্هͷԾఆ͕੒Γཱͭͱ͖ɺ൚Խό΢ϯυ͕όοναΠζ/ֶश཰ͱਖ਼ͷ૬ؔΛ࣋ͭ͜ͱΛূ໌Ͱ͖Δ (࣍ϖʔδͰ؆қతʹ঺հ) R(Q) : ະ஌ͷσʔλʹର͢Δଛࣦ, ̂ R(Q) : ܇࿅ଛࣦ, η : ֶश཰, C : ڞ෼ࢄߦྻ, |S| : όοναΠζ A : ہॴత࠷దղ෇ۙͷଛࣦؔ਺ͷϔοηߦྻ, d : ύϥϝʔλͷ࣍ݩ਺, N : σʔλ਺ R(Q) ≦ ̂ R(Q) + η 2|S| tr(CA−1) + d log(2|S| η ) − log(det(CA−1)) − d + 2 log(1 δ ) + 2 log(N) + 4 4N − 2 Ծఆ : d > η 2|S| tr(CA−1) [ཧ࿦తূ໌] OverParameterized΍ࡢࠓͷDNNΛߟ͑Ε͹ɺे෼ڐ༰Ͱ͖Δաఔ
  98. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 105 Control Batch Size and Learning Rate

    to Generalize Well: Theoretical and Empirical Evidence ൚Խό΢ϯυͱόοναΠζ/ֶश཰ͱͷਖ਼ͷ૬ؔ͸ҎԼͷΑ͏ʹͯࣔ͠͞ΕΔ [ཧ࿦తূ໌] ϧʔτ߲ͷ෼฼ Λ্هͷΑ͏ʹఆٛ͢ΔͱɺҎԼͷΑ͏ʹࣜมܗՄೳ I ൺ཰ ͷ ʹؔ͢Δಋؔ਺Λܭࢉͯ͠૬ؔؔ܎Λௐ΂Δ |S| η I ͱ͢Δͱ k = |S| η ͕ͨͬͯ͠൚Խό΢ϯυ͸όοναΠζ/ֶश཰ ͱਖ਼ͷ૬ؔؔ܎Λ࣋ͭ |S| η OverParameterized Ͱd→∞ͷԾఆ ͭ·ΓɺI ΛฏํࠜʹؚΉ൚Խ ό΢ϯυ͸ɺkʹؔͯ͠୯ௐ૿Ճ from "Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence", F. He+, NeurIPS 2019
  99. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 106 Control Batch Size and Learning Rate

    to Generalize Well: Theoretical and Empirical Evidence SGDͰֶश͞ΕͨDNNͷ൚Խೳྗʹର͢ΔόοναΠζ/ֶश཰ͷൺ཰ʹؔ͢ΔӨڹΛௐ΂ͨ [࣮ݧ֓ཁ] [࣮ݧํ๏] 4छྨͷ໰୊ઃఆ={CIFAR-10ͱCIFAR-100্ͰResNet-110ͱVGG-19Ͱͷֶश} 20छྨͷόοναΠζ={16,32,48,64,80,96,112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 272, 288, 304, 320} 20छྨͷֶश཰={0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20} ͜ΕΒͷશͯΛ૊Έ߹Θͤͯ1600 = (20 x 20 x 4)Ϟσϧͷ࣮ݧΛߦͬͨ ※ ֶशதͷόοναΠζͱֶश཰ΛҰఆͱ͠ɺΤϙοΫ਺͸200ɺmomentumͳͲͷSGDҎ֎ͷֶशςΫχοΫ ͸શͯ࢖༻͍ͯ͠ͳ͍
  100. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 107 Control Batch Size and Learning Rate

    to Generalize Well: Theoretical and Empirical Evidence from "Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence", Table. 1(ࠨ) Fig. 2(ӈ), F. He+, NeurIPS 2019 ֶश཰Λݻఆ͠CIFAR-10, CIFAR-100ΛResNet-110, VGG-19ʹ͓͍ͯ20ͷόοναΠζ(16ʙ320)Ͱֶश͠૬ؔΛௐ΂ͨ [࣮ݧ] SCC : εϐΞϚϯͷॱҐ૬ؔ܎਺ (1ʹ͍ۙ΄Ͳਖ਼ͷ૬ؔɺ-1ʹ͍ۙ΄Ͳෛͷ૬ؔ) p : ૬͕ؔͳ͍֬཰ ਫ਼౓ͱόοναΠζͷෛͷ૬ؔΛද͢άϥϑ ॎ࣠ɿਫ਼౓, ԣ࣠ɿόοναΠζ LRݻఆͷ৔߹όοναΠζ͸ େ͖͘ͳΔ΄Ͳਫ਼౓͕Լ͕Δ
  101. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 108 Control Batch Size and Learning Rate

    to Generalize Well: Theoretical and Empirical Evidence όοναΠζΛݻఆ͠CIFAR-10ͱCIFAR-100্ͰResNet-110ͱVGG-19Λ20ͷֶश཰(0.01ʙ0.2)Ͱֶश͠૬ؔΛௐ΂ͨ [࣮ݧ] SCC : εϐΞϚϯͷॱҐ૬ؔ܎਺ (1ʹ͍ۙ΄Ͳਖ਼ͷ૬ؔɺ-1ʹ͍ۙ΄Ͳෛͷ૬ؔ) p : ૬͕ؔͳ͍֬཰ ਫ਼౓ͱֶश཰ͷਖ਼ͷ૬ؔΛද͢άϥϑ ॎ࣠ɿਫ਼౓, ԣ࣠ɿֶश཰ from "Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence", Table. 2(ࠨ) Fig. 2(ӈ), F. He+, NeurIPS 2019 BSݻఆͷ৔߹LR͸ େ͖͘ͳΔ΄Ͳਫ਼౓্͕͕Δ
  102. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 109 Control Batch Size and Learning Rate

    to Generalize Well: Theoretical and Empirical Evidence લड़ͷ2ͭͷ࣮ݧ݁ՌΛݩʹਫ਼౓ͱόοναΠζ/ֶश཰ͷൺ཰ͷؔ܎Λϓϩοτͨ͠ [࣮ݧ] ਫ਼౓ͱɺόοναΠζ/ֶश཰ͷൺ཰ͷεϐΞϚϯͷॱҐ૬ؔ܎਺ٴͼp஋Λௐ΂ͨ from "Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence", Fig. 1(ࠨ)Table. 3(ӈ), F. He+, NeurIPS 2019 όοναΠζ/ֶश཰ͷ૬ؔʹؔ͢Δ࣮ূ࣮ݧͷ݁Ռ
  103. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 110 Painless Stochastic Gradient: Interpolation, Line-Search, and

    Convergence Rates from "Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates”, S. Vaswani+, NeurIPS 2019 - SGD͸؆୯ͳ࣮૷Ͱ࢖͍΍͘͢ྑ͍൚Խͷಛ௃Λ͕࣋ͭɺ ऩଋ͕஗͘ɺLRΛtuning͢Δඞཁ͕͋Δ - SGDͷ࣮༻తͳੑೳ͸LRʹେ͖͘ґଘ(Painful) => ຊݚڀͰ͸ɺLine SearchͷTechniqueΛ༻͍ͯLRΛ ࣗಈతʹઃఆ [֓ཁ] ໨తؔ਺ SGDͷߋ৽ଇ 12 ҟͳΔֶश཰ʹΑΔSGDͷऩଋͷ༷૬ͷ֓೦ਤ
  104. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 111 Painless Stochastic Gradient: Interpolation, Line-Search, and

    Convergence Rates from "Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates”, Fig. 1, S. Vaswani+, NeurIPS 2019 [Interpolation SettingͷԾఆ] యܕతͳNN΍දݱྗ๛͔ͳ Kernel Mapping ΛؚΉେن໛ͳOver-ParametrizedͳϞσϧͰຬͨ͢ ҰఆͷεςοϓαΠζͷSGDʹ͓͍ͯߴ଎ʹऩଋ͢Δ݁Ռʹ [SLS; Stochastic Line Search] खಈͰLRΛௐ੔ͤͣɺࣗಈͰLRΛ୳ࡧ Stochastic Armijo Condition LR͕͜ͷ৚݅Λຬͨ͢·Ͱ୳ࡧ ※ ͸σʔλϙΠϯτiʹؔ͢Δଛࣦؔ਺ fi (w*)
  105. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 112 Painless Stochastic Gradient: Interpolation, Line-Search, and

    Convergence Rates from "Painless Stochastic Gradient: Interpolation, Line- Search, and Convergence Rates”, Fig. 4(ӈ), . 5(ࠨ), S. Vaswani+, NeurIPS 2019 [࣮ݧ݁Ռ] SLS͸ɺΑΓ଎͘ΑΓྑ͍࠷దԽΛୡ੒(ӈਤ) ͨͩ͠ɺ1൓෮ͷܭࢉ࣌ؒ͸SGDΑΓ௕͔͔͘Δ(Լਤ)
  106. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 113 Using Statistics to Automate Stochastic Optimization

    from "Using Statistics to Automate Stochastic Optimization ”,Fig. 1, H. Lang+, NeurIPS 2019 - SGDͷLRͷௐ੔͕ɺػցֶशͷ࣮༻తͳੑೳΛಘΔͨΊͷେ͖ͳো֐ʹ => [໰୊] खಈͰTuningͨ͠৔߹͕ɺConstant LR΍Polynomial DecayΑΓੑೳ͕ߴ͍(Լਤ; SGM͸SGD w/ Momentum) - ൓෮͝ͱʹLRΛม͑ΔͷͰ͸ͳ͘ɺ࠷΋ҰൠతͳखಈTuningΛࣗಈԽ͢ΔΞϓϩʔνΛఏҊ => "Progress Stops” ·ͰConstant LRΛ࢖༻͠ɺͦͷޙLRΛ௿Լͤ͞Δ [ݚڀഎܠ] 13
  107. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 114 Using Statistics to Automate Stochastic Optimization

    from "Using Statistics to Automate Stochastic Optimization”,Algorithm 2,3(ࠨ), Fig. 2(ӈ), H. Lang+, NeurIPS 2019 - SGDͷμΠφϛΫε͕ఆৗ෼෍ʹ౸ୡͨ͠ͱ͖Λܾఆ͚ͮΔ໌ࣔతͳ౷ܭతݕఆΛઃܭ => ֶशதʹ࣮ߦ͠ɺLRΛҰఆͷ৐ࢉ܎਺Ͱݮগͤ͞Δ(ࠨԼਤ) => ౷ܭతదԠ֬཰ۙࣅ๏(SASA; Statistical Adaptive Stochastic Approximation)͸ࣗಈతʹྑ͍LR Schedule Λݟ͚ͭɺͦͷύϥϝʔλͷσϑΥϧτઃఆΛ࢖༻ͯ͠खಈTuning͞Εͨ৔߹ͷੑೳͱҰக͢Δ͜ͱΛ࣮ূ(ӈԼਤ) [Contribution]
  108. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 115 Stagewise Training Accelerates Convergence of Testing

    Error Over SGD from "Stagewise Training Accelerates Convergence of Testing Error Over SGD”, Fig. 1(ӈ)Algorithm 1(্) Z. Yuan+, NeurIPS 2019 - Stagewise Training Strategy͸ൺֱతେ͖ͳLR͔Βελʔτ͠ɺֶशաఔͰLRΛزԿֶతʹݮਰͤ͞ΔΞϧΰϦζϜͰɺ ޿͘NNͷֶशʹ༻͍ΒΕ͍ͯΔ - LR͕ଟ߲ࣜʹݮਰ͢ΔSGDΑΓ΋ɺStagewise SGDͷํ͕ɺ܇࿅͓Αͼςετޡ͕ࠩ଎͘ऩଋ͢Δ͜ͱ͕஌ΒΕ͍ͯΔ => ͜ͷݱ৅ʹ͍ͭͯͷઆ໌͕طଘݚڀͰ΄ͱΜͲߦΘΕ͍ͯͳ͍ͨΊɺཧ࿦ఏڙ [Reference] => Polyak-Łojasiewicz৚݅ԼͰܦݧతϦεΫΛ࠷খԽ͢ΔͨΊͷStagewise SDGΛݕ౼(START; Stagewise Regularized Training AlgorithmɺࠨԼਤ) => ܇࿅ޡࠩͱςετޡࠩͷ྆ํʹ͓͍ͯɺSGD w/ Poly DecayΑΓ΋Stagewise SDGͷऩଋ͕଎͍͜ͱΛ֬ೝ(ӈԼਤ) 14
  109. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 116 When does label smoothing help? [ݚڀഎܠ]

    - Label Smoothing(LS)͸ɺNN͕աֶश͢ΔͷΛ๷͗ɺը૾෼ྨɺݴޠ຋༁ɺԻ੠ೝࣝΛؚΉଟ͘ͷλεΫͰ༻͍ΒΕΔ - ͔͠͠ͳ͕ΒɺLabel Smoothing ͸ɺ·ͩे෼ʹཧղ͞Ε͍ͯͳ͍ => Penultimate Layer Representations(࠷ޙ͔Βೋ൪໨ͷLayerΛೋ࣍ݩฏ໘ʹදݱ͢Δख๏)Λ༻͍ͯղੳ Cross-Entropy Modified targets with label smoothing from "When does label smoothing help?", table. 1(ӈ),R. Müller+, NeurIPS 2019 15 Label Smoothingͷ֓೦ਤ
  110. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 117 When does label smoothing help? [Penultimate

    Layer Representations] from "When does label smoothing help?", Slide(্),Fig. 1(ӈ),R. Müller+, NeurIPS 2019 - 3ͭͷΫϥε(k1,k2,k3)ͱͦΕʹରԠ͢Δtemplates ( )ΛબͿ - 3ͭͷtemplatesΛ઀ଓͨ͠ฏ໘ʹactivationsΛ౤Ө(ӈਤ) wK1 , wK2 , wK3 - Label SmoothingΛߦ͏ͱɺactivations͸ਖ਼͍͠Ϋϥε ͷtemplatesʹۙ͘ɺ࢒Γͷ͢΂ͯͷΫϥεͷtemplates ʹ౳͘͠཭ΕΔ͜ͱΛ֬ೝ [Logits (approximate distance to template)]
  111. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 118 When does label smoothing help? [Implicit

    model calibration] from "When does label smoothing help?", Table. 3(ӈԼ),Fig. 2(ӈ্),R. Müller+, NeurIPS 2019 - Label Smoothing͕NNͷ༧ଌͷ ৴པ౓ΛΑΓਖ਼֬ʹද͢͜ͱͰɺ Ϟσϧͷcalibration͕޲্͢Δ͔ ݕূ(ӈਤ) - Expected calibration error (ECE) [Guo+, ICML2017]Λ༻͍ͯධՁ - Temperature Scaling(TS)ͱ͍͏؆ ୯ͳޙॲཧΛߦ͏͜ͱͰɺECEΛ ௿ݮ͠ɺωοτϫʔΫΛ calibrationͰ͖Δ͜ͱΛ࣮ূ
  112. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 119 When does label smoothing help? [Knowledge

    Distillation] from "When does label smoothing help?", Fig. 5(ࠨ),Fig. 6(ӈ),R. Müller+, NeurIPS 2019 - ڭࢣ͕Label SmoothingͰ܇࿅͞ Ε͍ͯΔ৔߹ɺTeacherͷ൚Խੑ ೳΛ޲্ͤ͞Δ͕ɺStudent΁͸ ѱӨڹΛٴ΅͢(ӈਤ) - Label Smoothing Λ༻͍ͨ Resnet Teacher(1) - Label Smoothingͳ͠ͷTeacher ͸ɺAlexNetͷStudent΁ͷTSΛ ༻͍ͨৠཹ͕༗ޮ(3) - Label SmoothingΛ࢖༻ͨ͠ Teacher͸ৠཹ͕ѱ͘(4)ɺLabel SmoothingΛ࢖༻ͯ͠܇࿅͞Εͨ StudentʹྼΔ(2) - Label SmoothingΛ༻͍Δͱಛ௃ྔۭ͕ؒมԽ͠ɺlogitؒͷ૬ޓ৘ใྔ͕ མͪΔͨΊɺಉ͡Α͏ͳσʔλͰֶ͔͠शͰ͖ͣৠཹͷ݁Ռ͕ྼԽ͢Δ
  113. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 120 Understanding the Role of Momentum in

    Stochastic Gradient Methods [ݚڀ֓ཁ] - heavy-ball momentum, Nesterov's accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM),ͳͲͷ MomentumΛ༻͍ͨख๏͕ɺػցֶशͷ͋ΒΏΔλεΫͰ༻͍ΒΕɺMomentumͷߴ͍ੑೳ͕஌ΒΕ͍ͯΔ - Momentum͸ɺԼهͷ఺ʹ͓͍ͯ໌֬ͳཧղ͕ಘΒΕ͍ͯͳ͍ͨΊQHMͷҰൠతͳఆࣜԽΛ༻͍ͯ౷Ұతʹղੳ 1. Asymptotic convergence / ઴ۙతऩଋ ໨తؔ਺ͷऩଋ৚݅ 2. Stability region and local convergence rate / ҆ఆྖҬͱہॴऩଋ཰ Stability Region ͰͷConvergence RateΛಋग़ɺऩଋͷ଎͞ 3. Stationary analysis / ఆৗղੳ ྑ͍ղʹऩଋͰ͖Δ͔ [Problem Setup] ඍ෼Մೳͳؔ਺ͷ࠷খԽ QHM algorithm [Ma and Yarats, 2019] ※ ͸ϥϯμϜϊΠζ β=0 : SGD β=1 : no Gradient ॏཁͳ3ͭͷύϥϝʔλЋЌЗ(LR, Momentum, Grad Coefficient) ν=0 : SGD ν=1 : SHB(Stochastic Heavy Ball) β=1 ͔ͭ ν=1 : NAG(Nesterov's Acceralated Gradient) 16
  114. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 121 Understanding the Role of Momentum in

    Stochastic Gradient Methods [1. Asymptotic convergence / ઴ۙతऩଋ(1/2)] from "Understanding the Role of Momentum in Stochastic Gradient Methods”, I. Gitman+, NeurIPS 2019 β → 0 ͷ৔߹͸SGDͱಉ༷ʹऩଋΛূ໌Մೳ
  115. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 122 Understanding the Role of Momentum in

    Stochastic Gradient Methods [1. Asymptotic convergence / ઴ۙతऩଋ(2/2)] from "Understanding the Role of Momentum in Stochastic Gradient Methods”, I. Gitman+, NeurIPS 2019 ν x β → 1 ͕े෼ʹ஗͍৔߹͸઴ۙతऩଋΛࣔͤΔ
  116. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 123 Understanding the Role of Momentum in

    Stochastic Gradient Methods [2. Stability region and local convergence rate / ҆ఆྖҬͱہॴऩଋ཰] from "Understanding the Role of Momentum in Stochastic Gradient Methods”,Fig. 1(্), I. Gitman+, NeurIPS 2019 νʹରͯ͠࠷దͳβΛٻΊΔɺβ͸νʹରͯ͠୯ௐݮগ αβνΛఆ਺ͱͯ͠ɺͦΕΒͷऩଋ৚݅Λಋग़ ࠷దͳαβνͷ૊Έ߹Θͤʹ͍ͭͯ βʹରͯ͠࠷దͳνΛٻΊΔɺ β͸νʹରͯ͠୯७ͳڍಈ͕ΈΒΕͳ͍ ໨తؔ਺Λೋ࣍ۙࣅ R͕খ͍͞΄Ͳऩଋ͕ૣ͍
  117. ɹ ֶशʹؔΘΔཧ࿦ [ֶश཰ɾMomentumɾDA] 124 Understanding the Role of Momentum in

    Stochastic Gradient Methods [3. Stationary analysis / ఆৗղੳ] from "Understanding the Role of Momentum in Stochastic Gradient Methods”, I. Gitman+, NeurIPS 2019 αҎ֎ͷґଘΛݟΔͨΊೋ࣍·Ͱ֦ுͯ͠TraceΛݟΔ ໨తؔ਺Λೋ࣍ۙࣅ ਖ਼نԽ͞ΕͨΞϧΰϦζϜͰͷఆৗ෼෍ͷCovߦྻ খ͍͞Ћ͕ྑ͍͜ͱʹՃ͑ͯɺೋ࣍·Ͱߟྀͨ͠৔߹ɺ େ͖ͳЌ͕ྑ͍LossΛୡ੒ɺЗͷڍಈΛଊ͑Δ͜ͱ͸͸೉͍͠ [α→0,β→1 ; ৗʹୡ੒ՄೳͳLoss·ͰԼ͕Δ] ఀཹ఺Ͱͷ࠷దղͷ෼ࢄ͸ЋΛؚΉҰ߲͕࣍ࢧ഑తͳͨΊ(্ਤ a,b)ɺֶश཰͕࠷΋Өڹ͕େ͖͍ɺЌЗͷӨڹ͕খ͍͞(্ਤa,c,d) [α→0,β→1 ; Loss্͕ྻΑΓ૿Ճ]
  118. ɹ ֶशʹؔΘΔཧ࿦ 125 ֶशʹؔΘΔཧ࿦ͷࡉ෼Խ Training Dynamics of DNN Optimizer Characteristics

    Initialization Learning RateɾMomentumɾDataAugmentation RegularizationɾNormalization Training Loop Initialization Mini-Batch SGD Training w/ Regularization
  119. ɹ ֶशʹؔΘΔཧ࿦ [ਖ਼نԽɾਖ਼ଇԽ] 126 Understanding and Improving Layer Normalization from

    "Understanding and Improving Layer Normalization”, Slide(ࠨ)Table. 1(த্)Fig. 1(தԼ)Table. 2(ӈ্)Table. 4(ӈԼ),J. Xu+, NeurIPS 2019 - Layer Normalization(LN)͸தؒ૚ͷ෼෍Λਖ਼نԽ͢Δख๏ => ޯ഑ΛΑΓ׈Β͔ʹ͠ɺֶशΛߴ଎ʹɺ൚Խੑೳ΋վળ͢Δ (TransformerͳͲ΋LNΛDefaultͷSetting strong)ͱͯ͠Δ - ͔͠͠ͳ͕ΒɺLayer Normalizationͷੑೳ͕Ͳ͜ʹىҼ͢Δͷ͔໌֬Ͱ͸ͳ͍ͨΊཧղΛਂΊΔ͜ͱΛ໨ඪ - LN͕Forward / BackwardͰͷޮՌΛ੾Γ෼͚ΔͨΊLNͷύϥϝʔλ࢖Θͳ͍LN-Simple(DetachNorm͸ఆ਺Λ࢖͏)Λݕূ => LN-Simpleͷํ͕ੑೳ͕ߴ͍͜ͱ͔ΒɺLNͰ͸BackwardͰޯ഑Λਖ਼نԽ͢Δ͜ͱͰաֶशͷϦεΫΛߴΊ͍ͯΔ(தԝਤ) - ্ه͔ΒɺBias,GainΛ৽͍͠ม׵ؔ਺ʹஔ͖׵͑ͨAdaNormΛఏҊɺLNΑΓߴ͍ੑೳΛୡ੒͢Δ͜ͱΛ֬ೝ(ӈԼද) 17 Layer Normalization ͷ֓೦ਤ
  120. ɹ ֶशʹؔΘΔཧ࿦ [ਖ਼نԽɾਖ਼ଇԽ] 127 Time Matters in Regularizing Deep Networks:

    Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence from "Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence”, A. Golatkar+, Fig. 3, NeurIPS 2019 - ਖ਼ଇԽ͸DNNϞσϧ͕࠷ऴతʹऩଋ͢ΔہॴతͳLoss LandscapeΛมԽͤ͞ɺҰൠԽΛվળ͢Δ΋ͷͱཧղ͞Ε͍ͯΔ - DNNͷֶशʹ͓͚Δਖ਼ଇԽͷλΠϛϯάͷॏཁੑ(ಛʹinitial transient period)ʹ͍ͭͯओு ∵ ਖ਼ଇԽΛֶशॳظ͔Βߦ͏৔߹͸ɺ్தͰਖ਼ଇԽΛ੾ͬͯ΋ޮՌ͕͋Δ͕ɺinitial transient period ޙʹਖ਼ଇԽΛద༻ ࢝͠ΊΔͱɺ൚ԽΪϟοϓ͸ਖ਼ଇԽͳ͠ͷֶशͱಉఔ౓ʹͳΔ - Weight Decay, Data Augmentation, MixupͳͲͷਖ਼ଇԽख๏ΛҟͳΔλεΫɾϞσϧͰݕূ(Լਤ) - DNNͷਖ਼ଇԽʹ͸ɺ࠷ऴతͳੑೳΛܾఆ͚ͮΔ”Critical Period”͕͋Δ͜ͱ͕ࣔࠦ (Top) (Center) (Bottom) 18
  121. ɹ ֶशʹؔΘΔཧ࿦ [ਖ਼نԽɾਖ਼ଇԽ] 128 Towards Explaining the Regularization Effect of

    Initial Large Learning Rate in Training Neural Networks from "Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks", Y. Li+, Fig. 1, NeurIPS 2019 - LRΛ͋Δ్தͷEpoch͔ΒAnnealingͤ͞ΔLR SchedulingΛߟ͑Δ - AnnealingΛ։࢝͢ΔλΠϛϯά·Ͱ͸Train/TestͷPerformance͕ߴ͍ - Large LR͸Small LRΑΓ΋࠷ऴతͳTest Performance͕ߴ͍(Լਤ) [ݚڀഎܠ] ←͜ͷ࿦จͰ͸͜ͷݱ৅ͷཧ༝Λ୳Δ 19
  122. ɹ ֶशʹؔΘΔཧ࿦ [ਖ਼نԽɾਖ਼ଇԽ] 129 Towards Explaining the Regularization Effect of

    Initial Large Learning Rate in Training Neural Networks Large LR͸Small LRΑΓ΋࠷ऴతͳTest Performance͕ߴ͍ => ҟͳΔλΠϓͷύλʔϯΛֶश͢Δॱং͕ॏཁ ͱԾઆΛཱͯͨ [ݚڀԾઆ] - Small LR ͸ Hard-to-Fit ͳ ”Watermask”Λૉૣ͘هԱ͠ɺଞͷύλʔϯΛແࢹͯ͠ҰൠԽΛ્֐͢Δ - Large LRʴAnnealing ͸࠷ॳʹ Easy-to-Fit ͳύλʔϯΛֶश͢Δ => Hard-to-Fit ͳύλʔϯ͸ Annealing ޙʹֶ͔͠श͠ͳ͍ => ࠷ऴతʹશͯͷύλʔϯΛֶश͠ɺ൚Խ͢Δ - Large LR ͔Βͷ SGD Noise ͕ਖ਼ଇԽͷݪҼ - ඇತͰ͋Δ͜ͱ͕ॏཁɿತͷ໰୊Ͱ͸ɺSmallLR / Large LR Schedule ͕ಉ͡ղΛٻΊΔ ͜ͷݱ৅Λઆ໌͢ΔͨΊɺLarge LR + Annealing Λߦͬͨ2૚ωοτϫʔΫ͕ɺ ࠷ॳ͔ΒSmall LRͰֶशͨ͠ಉ͡NNΑΓ΋൚Խੑೳ͕ߴ͍͜ͱΛূ໌Ͱ͖ΔΑ͏ͳઃఆΛߟҊ [Contribution]
  123. ɹ ֶशʹؔΘΔཧ࿦ [ਖ਼نԽɾਖ਼ଇԽ] 130 Towards Explaining the Regularization Effect of

    Initial Large Learning Rate in Training Neural Networks from "Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks", Y. Li+, NeurIPS 2019 [࣮ݧઃఆ] ֶशσʔλ͸Լهͷ2ύλʔϯ ֶशσʔλΛ1,2,3 ʹ3෼ׂ (ͦΕͧΕֶशσʔλͷ20%, 20%, 60%) hard-to-generalize ͔ͭ easy-to-fit => ΄΅ઢܗ෼཭Մೳͳύλʔϯ easy-to-generalize ͔ͭ hard-to-fit => ΫϥελʔԽ͞Ε͍ͯΔ͕ઢܗ෼཭͸ෆՄೳ [ύλʔϯA] [ύλʔϯB] [σʔλάϧʔϓ1] [σʔλάϧʔϓ2] [σʔλάϧʔϓ3] ύλʔϯAͷΈ ύλʔϯBͷΈ ύλʔϯA, B྆ํ • Small LR͸ύλʔϯBΛ͙͢ʹهԱ͠ɺύλʔϯAΛແࢹ͢Δ • Large LR͸࠷ॳʹύλʔϯAΛֶशɺSGD Noise͸ Annealingޙ·ͰύλʔϯBͷֶशΛ๦͛Δ
  124. ɹ ֶशʹؔΘΔཧ࿦ [ਖ਼نԽɾਖ਼ଇԽ] 131 Towards Explaining the Regularization Effect of

    Initial Large Learning Rate in Training Neural Networks from "Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks”, Fig. 4, Y. Li+, NeurIPS 2019 [࣮ݧʹΑΔݕূ] CIFAR10 ʹɺexample’sͷΫϥεΛࣔ͢ύονΛ௥Ճ => ઢܗ෼཭Ͱ͖ͳ͍ύον(Լਤ) • Small LR͸ύονΛهԱ͠ɺ࢒Γͷը૾Λແࢹ͢Δ • Large Initial LR͸ύονΛແࢹ͠ɺAnnealingޙʹֶ͔͠श͠ͳ͍
  125. ɹ ຊࢿྉͷ֓ཁ ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦: Over-Parameterized DNN, Winning Ticket Hypothesis ͳͲ ֶशʹؔΘΔཧ࿦:

    Dynamics of SGD Training, Regularization, Normarization, Initialization, Optimizer, LR ͳͲ Adversarial Examples, Large Batch Problem ͳͲ 132 ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ ৽͍͠໰୊ͱͦͷղܾࡦ ͜ͷষͷ·ͱΊ • ը૾ೝࣝͰSoTAͳϞσϧNoisy StudentͰ༻͍ΒΕΔTeacher StudentͷֶशμΠφϛΫε΍ɺΨ΢εաఔ΁ͷ઴ۙͳͲ΋ OverParameterizedͳঢ়گͰݚڀ͞Εͨɺ·ͨɺLabel SmoothingͳͲ͸Teacherͷֶशʹ͸༗ޮɺৠཹޙͷStudentʹѱӨڹΛٴ΅͢ ࠷దԽख๏͸ɺࡢ೥·Ͱ͸SGDͷ༗ޮੑ͕஫໨͞Ε͍͕ͯͨɺNeurIPS2019Ͱ͸Ճ଎๏ɾPre-conditionedɾEMAͷ3ύλʔϯͷ༗ޮੑʹ ͍ͭͯ΋NQM΍Critical BSͳͲͷࢦඪͰٞ࿦͞Εͨɺ·ͨHyperParamʹର͢Δϩόετੑʹண໨ͨ͠ݚڀ΋஫໨ΛूΊͨ • ॳظԽʹ͓͍ͯ͸ϝλֶशΛ૝ఆͨ͠໰୊ઃఆ΍ResNet΍Weight NormalizationͳͲͷSpecificͳ໰୊ʹର͢Δ༗ޮͳॳظԽख๏͕ ೖग़ྗͷNormௐ੔΍ۂ཰ͷ؍఺͔ΒఏҊ͞Εͨɺ͔͠͠ͳ͕ΒॳظԽʹ͓͚Δ෼෍ʹ͍ͭͯ͸ࠓޙݚڀͷඞཁ͕͋Δ • LR͸SGDͷֶशʹ͓͍ͯɺॏཁͳ໾ׂΛՌ͕ͨ͢Tuningͷίετ͕ߴ͍ͨΊɺͦΕΛࣗಈԽ͢Δݚڀ΍ɺߴ͍ੑೳΛࣔ͢LR Schedulingͷ ਖ਼౰ੑͷূ໌ɺBSͱLRͷ૬ؔؔ܎ͷཧ࿦ɾ๲େͳ࣮ݧͰͷূ໌ɺMomentumͱؔ࿈ύϥϝʔλͷཧղʹؔ͢Δݚڀ͕ൃද͞Εͨ • ਖ਼ଇԽ͸ͲͷλΠϛϯάͰద༻͢Δ͔͕ॏཁͰ͋Δͱࣔͨ͠ݚڀ΍ɺLRͷॱং(Scheduling)͕ਖ਼ଇԽͷޮՌ͕͋Δͱओு͕৽͍͠ํ޲ੑΛ ࣔͨ͠ଞɺNormalizationͷޮՌ͕ForwardɾBackwardͰҟͳΔ͜ͱΛࣔ͠վળͤͨ͞ݚڀ͕ൃද͞Εͨ
  126. ɹ ຊࢿྉͷ֓ཁ 133 ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦: Over-Parameterized DNN, Winning Ticket Hypothesis ͳͲ

    ֶशʹؔΘΔཧ࿦: Dynamics of SGD Training, Regularization, Normarization, Initialization, Optimizer, LR ͳͲ Adversarial Examples, Large Batch Problem ͳͲ ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ ৽͍͠໰୊ͱͦͷղܾࡦ
  127. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[AEs] 134 ࿦จϦετ(4/5) • Adversarial training for free! •

    Detecting Overfitting via Adversarial Examples • Convergence of Adversarial Training in Overparametrized Neural Networks • Adversarial Robustness through Local Linearization Adversarial Examples ஫ҙ • ຊষͰ͸ Adversarial Examples ʹ͍ͭͯҰ͔Βղઆ͸ߦ͍·ͤΜɺ·ͨࢿྉͷΈͰൃද͸க͠·ͤΜ • AEs ʹ͍ͭͯҰ͔Β஌Γ͍ͨํ͸ɺԼهͷࢿྉΛࢀߟʹ͢Δ͜ͱΛ͓קΊ͠·͢(※ܝࡌڐՄΛ͍͍͍ͨͩͯ·͢) ෱ݪ٢ത, CVPR2019 ໢ཏతαʔϕΠใࠂձ, Adversarial Examples ෼໺ͷಈ޲, 2019 https://www.slideshare.net/cvpaperchallenge/adversarial-examples-173590674
  128. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[AEs] 135 Adversarial training for free! - Adversarial Training

    ͸ AEsʹରॲ͢Δ๷ޚࡦͰ͋Δ͕ɺImageNetͳͲͷେن໛ͳ໰୊Ͱ͸ܭࢉίετ͕ߴ͍ => ಛʹڧྗͳAEsΛੜ੒͢Δϓϩηε͕ϘτϧωοΫ - ϞσϧύϥϝʔλΛߋ৽͢Δࡍʹܭࢉ͞Εͨޯ഑৘ใΛ࠶ར༻͠ɺڧྗͳAEsΛੜ੒͢ΔܭࢉίετΛ࡟ݮ - CIFAR10, CIFAR100ʹ͓͍ͯ௨ৗͷֶशͱ΄΅ಉ͡ίετͰɺ PGD(Projected Gradient Descent) Adversarial Training ͱಉ౳ͷϩόετੑΛୡ੒(ࠨԼදɺӈԼਤ͸ଛࣦؔ਺ͷ3࣍ݩPlot) - ଞͷAdversarial Trainingʹൺ΂ɺ7-30ഒͷߴ଎Խ from "Adversarial training for free!", Table. 4(ࠨԼ)Fig. 3(ӈ)A. Shafahi+, NeurIPS 2019 20
  129. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[AEs] 136 Detecting Overfitting via Adversarial Examples from "Detecting

    Overfitting via Adversarial Examples", Fig. 7, R. Werpachowski+, NeurIPS 2019 - ςετηοτͷ࠶ར༻ͷ఺ͰҰൠతͳਂ૚ֶश Λ༻͍ͨλεΫͷϕϯνϚʔΫͷςετΤϥʔ ͸৴པੑʹٙ໰͕͋Δ - ֶशͨ͠Ϟσϧ͕ςετηοτʹΦʔόʔ ϑΟοτͯ͠Δ͔Ͳ͏͔Λݕূ͍ͨ͠ => ಉ͡σʔλ෼෍͔Β࡞ΒΕΔςετηοτ ͸࢖͑ͳ͍ɺଞͷσʔληοτΛ༻͍Δͱ෼෍ γϑτͷ໰୊͕͋Γࠔ೉ => ղܾख๏ͱͯ͠ΦϦδφϧͷσʔληοτ ͷΈΛ༻͍ͯݕূ͢Δख๏ΛఏҊ - ςετσʔλ͔Βੜ੒͞ΕͨAEsͱॏཁ౓ͷॏ Έ෇͚ʹج͍ͮͨ৽͍͠ෆภޡࠩਪఆΛར༻ => ͜ͷޡࠩਪఆ஋͕ݩͷςετޡࠩ཰ͱे෼ ʹҟͳΔ৔߹ɺΦʔόʔϑΟοτΛݕग़Մೳ (ӈਤ) - ͔͠͠ɺtrain dataʹؔ͢ΔΦʔόʔϑΟοτ ͸ݕग़Մೳ͕ͩɺtest dataʹؔͯ͠͸ෆՄೳ 21
  130. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[AEs] 137 Convergence of Adversarial Training in Overparametrized Neural

    Networks from "Convergence of Adversarial Training in Overparametrized Neural Networks", R. Gao+, NeurIPS 2019 - Adversarial Training ͸ɺ࠷খԽͱ࠷େԽΛަޓ ʹߦ͏ϩόετ࠷దԽͷൃݟతͳख๏Ͱ͋Γɺࣄ લʹఆٛ͞Εͨઁಈʹରͯ͠ϩόετʹͳΔΑ͏ ʹωοτϫʔΫΛ܇࿅͢Δ - ͜ͷݚڀͰ͸NTKΛ༻͍ͯɺAdversarial TrainingͷֶशͷऩଋੑΛূ໌ - ߈ܸΞϧΰϦζϜʹର͢Δαϩήʔτଛࣦ͕࠷ద ϩόετଛࣦϵҎ಺ͷωοτϫʔΫʹऩଋ͢Δ => ࠷దϩόετଛࣦ΋θϩʹ͍ۙ͜ͱΛࣔ͠ɺ Adversarial Training ʹΑΓϩόετͳ෼ྨث͕ ಘΒΕΔ͜ͱΛࣔ͢ - ϩόετͳิؒΛߦ͏ͨΊʹ͸Ϟσϧͷ༰ྔΛ૿ ΍͢ඞཁ͕͋Δ͜ͱΛূ໌ɺAdversarial Trainingʹ͸ΑΓ޿͍ωοτϫʔΫ͕ඞཁͰ͋Δ ͜ͱΛओு 22
  131. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[AEs] 138 from "Adversarial Robustness through Local Linearization", Table.

    2(ࠨ)Fig. 4(ӈ), C. Qin+, NeurIPS 2019 Adversarial Robustness through Local Linearization - Adversarial Training ͸ɺϞσϧͷαΠζͱೖྗ࣍ݩͷ਺͕େ͖͘ͳΔʹͭΕͯɺఢରత܇࿅ͷܭࢉίετ͸ඇৗʹ૿େ - ֶशCostΛ཈͑ΔͨΊऑ͍AEsʹର͢Δ܇࿅ => ऑ͍߈ܸʹରͯ͠͸ϩόετɺ͔͠͠ڧ͍߈ܸʹରͯ͠͸ରԠͰ͖ͳ͍ => ޯ഑೉ಡԽͱ͍͏ݱ৅ʹىҼ - ֶशσʔλͷۙ๣Ͱଛࣦ͕ઢܗతʹৼΔ෣͏Α͏ʹଅ͢৽͍͠ਖ਼ଇԽ๏(LLR; Local Linearity Regularization)Λಋೖ => ͦΕʹΑͬͯޯ഑೉ಡԽʹϖφϧςΟΛ༩͑Δͱಉ࣌ʹϩόετੑΛ޲্(ࠨԼද,ӈԼਤ) - ैདྷͷAdversarial TrainingΑΓLLRΛ༻ֶ͍ͨश͕ߴ଎Ͱ͋Γɺޯ഑೉ಡԽΛճආ͢Δ͜ͱΛࣔͨ͠ 23
  132. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[Others] 139 ࿦จϦετ(5/5) • GPipe: Efficient Training of Giant

    Neural Networks using Pipeline Parallelism • AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters Others
  133. • ਂ૚ֶशͷϞσϧΛϦιʔε੍ݶͷ͋ΔσόΠεʹ౥ࡌ͢Δ৔߹ => Ϟσϧͷ৑௕ੑΛݮΒ͢ඞཁ => ωοτϫʔΫͷϓϧʔχϯά (Pruning: ࢬמΓ) • ϓϧʔχϯά໰୊͸ԼهͷΑ͏ʹఆࣜԽͰ͖Δ

    ͸θϩϊϧϜ͔θϩͰͳ͍ॏΈͷ਺ɺ ॏΈΛ௚઀ਖ਼ଇԽ͢ΔͱɺϋΠύʔύϥϝʔλ ʹରͯ͠හײʹͳͬͯ͠·͍ɺֶशͷෆ҆ఆੑʹ΋ܨ͕Δ • ैདྷख๏ͷॏΈΛ௚઀ਖ਼نԽ͢Δϓϧʔχϯά Ͱ͸ ϋΠύʔύϥϝʔλ΁ͷґଘ͕ඇৗʹେ͖͍ͱ͍͏໰୊ => AutoPruneͰ͸ิॿύϥϝʔλʔΛಋೖ͠ɺͦΕΒΛ࠷దԽͯ͠ϓϧʔχϯά => ݁Ռͱͯ͠ɺϋΠύʔύϥϝʔλͷӨڹΛड͚ͮΒ͘ɺඞཁͳࣄલ஌ࣝΛܰݮ ||w|| 0 ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[Others] 140 AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters [֓ཁ] ϓϧʔχϯά (Pruning) 24 Pruningͷ֓೦ਤ
  134. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[Others] 141 AutoPrune: Automatic Network Pruning by Regularizing Auxiliary

    Parameters [ఏҊख๏] hi,j = { 0, if wi,j is pruned; 1, otherwise . hi,j = h(mi,j ) ຊख๏Ͱͷϓϧʔχϯά͸ɺҎԼͷ൓෮࠷খԽ໰୊ͱͳΔ min w L1 = min w N ∑ i=1 L( f(xi , W ⊙ h(M)), yi ) + λR(W), xi ∈ Xtrain min m L2 = min m N ∑ i=1 L( f(xi , W ⊙ h(M)), yi ) + μR(h(M)), xi ∈ Xval ΑͬͯɺఏҊख๏Ͱͷิॿύϥϝʔλ ͷߋ৽ࣜ͸ҎԼͷΑ͏ʹͳΔ M mi,j := mi,j − η( ∂Lacc ∂ti,j sgn(wi,j ) ∂h(mi,j ) ∂mi,j ) Lacc = L( f(xi , W ⊙ h(M)), yi ) ti,j = wi,j ⊙ h(mi,j ) from "AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters", X. Xiao+, NeurIPS 2019 ࢦࣔؔ਺ ͱ Λਪఆ͢ΔֶशՄೳͳิॿύϥϝʔλ Λಋೖ͢Δ h h M ͜͜Ͱɺ ͸ਖ਼ଇԽ߲Λɺ ͸PruningޙͷॏΈߦྻΛࣔ͢ ܇࿅σʔληοτΛ ͱ ʹ෼ׂ͢Δ͜ͱͰɺ୯Ұͷଛࣦؔ਺ͷ ࠷খԽ͔Β࣍ͷଛࣦؔ਺ͷ൓෮࠷খԽʹ͞Βʹ໰୊Λ࠶ఆࣜԽͰ͖Δ R W ⊙ h(M) Xtrain Xval ॏΈͷେ͖͞ɺॏΈͷมߋɺBNNͷޯ഑ͷํ޲ͱҰக ͢ΔΑ͏ʹิॿύϥϝʔλΛमਖ਼͢Δߋ৽ d ิॿύϥϝʔλΛ༻͍ͨߋ৽ͷ֓೦ਤ
  135. • ิॿύϥϝʔλײ౓͕ɺରԠ͢ΔॏΈͷେ͖͞ʹ൓ൺྫ͢Δʢ ʣ → େ͖ͳॏΈͷখ͞ͳมԽʹಷײʹͳΔ͜ͱͰݎ࿚ੑͷ޲্ • ิॿύϥϝʔλͷޯ഑ͷํ޲͕ରԠ͢ΔॏΈͷϊϧϜͷޯ഑ͷํ޲ͱҰக( ) → ิॿύϥϝʔλ

    ͕ ͷେ͖͞ʹ௥ै͢ΔͨΊɺ ͕0ʹۙͮ͘ࡍʹ ΋0ʹ͖ۙͮɺϓϧʔχϯά͕Ճ଎ • ޡͬͨϓϧʔχϯάʹΑΔଛࣦͷ૿ՃΛमਖ਼Մೳʢճ෮Մೳੑͷอ࣋ʣ → ϓϧʔχϯάޙͷॏΈΛอ͍࣋ͯ͠ΔͨΊɺ ଛࣦ͕૿Ճͨ͠৔߹ʹࢬङΓͨ͠ॏΈΛ෮ݩ͢Δ͜ͱͰଛࣦ͕૿ՃΛ๷͙͜ͱ͕Ͱ͖Δ • ϋΠύʔύϥϝʔλͷӨڹΛड͚ʹ͍͘ (Sensitivityͷ௿ݮ) mi,j wi,j |wi,j | mi,j ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[Others] 142 AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters [ఏҊख๏ͷར఺] ∂Lacc ∂mi,j ∝ 1 f(|wi,j |) sgn( ∂L2 ∂mi,j ) = sgn( ∂L1 ∂|wi,j | )
  136. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[Others] 143 AutoPrune: Automatic Network Pruning by Regularizing Auxiliary

    Parameters [࣮ݧ݁Ռ] AutoPrune͓Αͼطଘख๏ʹΑΔχϡʔϩϯͷϓϧʔχϯάͷ࣮ݧ MNIST (Table 1)ɺCIFAR-10 (Table 2)ɺImageNet (Table 3)ͷ3ͭͷσʔληοτͰ࣮ݧ NCR : Neuron Compression Rate from "AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters", Table. 1(ࠨ). 2(ӈ্). 3(ӈԼ), X. Xiao+, NeurIPS 2019
  137. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[Others] 144 AutoPrune: Automatic Network Pruning by Regularizing Auxiliary

    Parameters [࣮ݧ݁Ռ] CIFARͷVGG-LikeͰͷֶशʹؔ͢ΔAutoPruneʹΑΔχϡʔϩϯͷϓϧʔχϯάͷ࣮ݧ => ֶश཰΍STE(Straight Through Estimator※)ͷબ୒ʹґଘ͠ͳ͍݁Ռʹ CR : Compression Rate from "AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters", Fig. 2, X. Xiao+, NeurIPS 2019 ※STE: ޯ഑ͷਪఆͷࡍʹઈର ஋͕1ҎԼͷޯ഑ͷΈ༻͍ͯ ߋ৽͢Δख๏ Auto PruneʹΑΔৠཹͷϋΠύʔύϥϝʔλ΁ͷґଘੑʹؔ͢Δ࣮ݧ
  138. ɹ ৽͍͠໰୊ͱͦͷղܾࡦ[Others] 145 GPipe: Efficient Training of Giant Neural Networks

    using Pipeline Parallelism - GPipe͸εέʔϥϒϧͳύΠϓϥΠϯฒྻԽϥΠϒϥϦ (ϞσϧฒྻͷΞϓϩʔνͰ෼ࢄਂ૚ֶशΛߦ͏ɺதԝԼਤ) - ڊେͳ(DNN 600 millions parameters)ͷֶश͕Մೳ(ࠨԼਤ) - OOC໰୊(ϝϞϦͷ࢖༻ঢ়گ࡟ݮͷͨΊ)͸swapoutͰ͸ͳ͘recomputeͰରԠ - ΞΫηϥϨʔλΛ8ഒʹͯ͠25ഒͷεέʔϧͷDNNֶ͕शͰ͖ͨ - ϞσϧύϥϝʔλΛมߋ͢Δ͜ͱͳ͘΄΅௚ઢతͳεϐʔυΞοϓΛୡ੒(4x ͷΞΫηϥϨʔλͰ3.5xߴ଎Խ, ӈԼਤ) - AmoebaNetϞσϧΛ5ԯ5700ສύϥϝʔλͰֶशɺImageNet1K্Ͱ࠷৽ͷ84.3ˋͷtop1-Acc / 97.0ˋͷtop5-AccΛୡ੒ =>͜ͷϞσϧΛFinetuneͨ͠΋ͷΛ7ͭͷσʔληοτͰࢼ͠ɺશͯ࠷ߴੑೳʹ from "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”, Fig. 1(ࠨ). 2(தԝ)Table. 4(ӈ), Y. Huang+, NeurIPS 2019 25
  139. ɹ ຊࢿྉͷ֓ཁ 146 ൚ԽੑೳɾେҬతऩଋੑʹؔΘΔཧ࿦: Over-Parameterized DNN, Winning Ticket Hypothesis ͳͲ

    ֶशʹؔΘΔཧ࿦: Dynamics of SGD Training, Regularization, Normarization, Initialization, Optimizer, LR ͳͲ Adversarial Examples, Large Batch Problem ͳͲ ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ ৽͍͠໰୊ͱͦͷղܾࡦ AEsͷֶशίετ͕ߴ͍໰୊Λ(1)ޯ഑৘ใΛ࢖͍ճ͢(2)ޯ഑೉ಡԽΛਖ਼ଇԽͰ๷͗Adversarial TrainingΛߴ଎Խ͢Δ2छͷख๏ɺNTKΛ༻͍ͯେҬతऩଋੑΛࣔͨ͠ݚڀ͚ͩͰͳ͘ɺAEsΛ༻͍ͯ ֶशσʔλʹର͢ΔOverfittingΛݕ஌͢Δݚڀͷํ޲ੑ΋ࣔ͞Εͨ େن໛ͳϞσϧΛେن໛ͳσʔλͰֶश͢ΔͨΊͷϞσϧฒྻͷͨΊͷύΠϓϥΠϯॲཧϥΠϒϥϦ ΍ɺେن໛ͳϞσϧΛΤοδσόΠεͰ༻͍ΔͨΊৠཹʹؔͯࣗ͠ಈԽ͢Δख๏͕ఏҊ͞Εͨ ͜ͷষͷ·ͱΊ
  140. ɹ ࠷ޙʹ 147 ্هͷ෼໺ͷݚڀ(1,5)Λ؆୯ʹ঺հ͠·ͨ͠ ؾʹͳΔ࿦จͷৄࡉ͸Slide Live, Paper౳Λ͝ཡ͍ͩ͘͞ NeurIPS2019ͷτϨϯυ·ͱΊ • ਂ૚ֶशͷഎޙʹ͋Δཧ࿦Λ୳ٻ͢Δଟ͘ͷ࿦จ͕ൃද͞Εͨ

    1. ਂ૚ֶशͷֶशμΠφϛΫεͷཧղɾBlackBoxతͳಈ࡞ͷղ໌ • ϕΠζਂ૚ֶश • άϥϑχϡʔϥϧωοτϫʔΫ • ತ࠷దԽ 2. ਂ૚ֶश΁ͷ৽͍͠Ξϓϩʔν 3. ػցֶश΁ͷਆܦՊֶͷಋೖ 4. ڧԽֶश 5. ৽͍͠໰୊ͱͦͷղܾࡦ • ఢରతαϯϓϧ (AEs :Adversarial Examples) ͳͲ • ϝλֶश • ੜ੒Ϟσϧ Main-Focus in This Slide Sub-Focus in This Slide
  141. ɹ 148 ँࣙ • ຊεϥΠυͷ࡞੒ʹ͋ͨΓɺࢦಋڭһͰ͋Δ౦ژ޻ۀେֶ ֶज़ࠃࡍ৘ใηϯλʔͷԣాཧԝ ।ڭत ٴͼɺཧԽֶݚڀॴ ֵ৽஌ೳ౷߹ݚڀηϯλʔਂ૚ֶशཧ࿦νʔϜͷླ໦େ࣊ ।ڭतʹ͝ࢦಋΛ௖͖·ͨ͠

    ͜ͱʹް͘ޚྱਃ্͛͠·͢ɻిྗதԝݚڀॴͷ అ ෋࢜༤ ത࢜ʹ͸ࢿྉͷ·ͱΊํʹؔͯ͠وॏͳ͝ॿݴ Λ௖͍ͨ͜ͱʹਂ͘ײँਃ্͛͠·͢ɻ • ࠷ޙʹɺαʔϕΠͨ͠࿦จʹؔ͠·༷ͯ͠ʑͳٞ࿦Λ͍͖ͤͯͨͩ͞·ͨ͠ૣҴాେֶجװཧ޻ֶ෦ ֶ࢜՝ఔ ֶੜͷṘ໦༔࢜͞Μɼ౦ژ޻ۀେֶ৘ใཧ޻ֶӃ मֶ࢜ੜͷॴാوେ͞Μʹײँ͍ͨ͠·͢ɻ
  142. ɹ 149 • [H. He+, 2019] Asymmetric Valleys: Beyond Sharp

    and Flat Local Minima • [A. Morcos+, 2019] One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers • [R. Karakida+, 2019] The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks • [J. Nagarajan+, 2019] Uniform convergence may be unable to explain generalization in deep learning • [Y. Cao+, 2019] Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks • [C. Yun+, 2019] Are deep ResNets provably better than linear predictors? • [C. Yun+, 2019] Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity • [D. Kalimeris+, 2019] SGD on Neural Networks Learns Functions of Increasing Complexity • [J. Lee+, 2019] Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent • [T. Nguyen+, 2019] First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise • [Z. Allen-Zhu+, 2019]ɹCan SGD Learn Recurrent Neural Networks with Provable Generalization? • [S. Goldt+, 2019] Dynamics of stochastic gradient descent for two-layer neural networks in the teacher- student setup • [G. Zhang+, 2019] Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model NeurIPS2019࿦จ(1/3) Reference (1/4)
  143. ɹ Reference (2/4) 150 • [G. Zhang+ 2019] Fast Convergence

    of Natural Gradient Descent for Over-Parameterized Neural Networks • [M. Zhang+, 2019] Lookahead Optimizer: k steps forward, 1 step back • [N. Muecke+, 2019] Beating SGD Saturation with Tail-Averaging and Minibatching • [B. Chen+, 2019] Fast and Accurate Stochastic Gradient Estimation • [Y. Dauphin+, 2019] MetaInit: Initializing learning by learning to initialize • [D. Arpit+, 2019] How to Initialize your Network? Robust Initialization for WeightNorm & ResNets • [S. Vaswani+, 2019] Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates • [H. Lang+, 2019] Using Statistics to Automate Stochastic Optimization • [Z. Yuan+, 2019] Stagewise Training Accelerates Convergence of Testing Error Over SGD • [F. He+, 2019] Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence • [R. Müller+, 2019] When does label smoothing help? • [I. Gitman+, 2019] Understanding the Role of Momentum in Stochastic Gradient Methods • [R. Ge+, 2019] The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares • [J. Xu+, 2019] Understanding and Improving Layer Normalization • [V. Chiley+, 2019] Online Normalization for Training Neural Networks • [S. Arora+, 2019] Implicit Regularization in Deep Matrix Factorization • [A. Golatkar+, 2019] Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence NeurIPS2019࿦จ(2/3)
  144. ɹ 151 • [Y. Li+, 2019] Towards Explaining the Regularization

    Effect of Initial Large Learning Rate in Training Neural Networks • [A. Shafahi+, 2019] Adversarial training for free! • [R. Werpachowski+, 2019] Detecting Overfitting via Adversarial Examples • [Z. Li+, 2019] Learning from brains how to regularize machines • [C. Laidlaw+, 2019] Functional Adversarial Attacks • [M. Naseer+, 2019] Cross-Domain Transferability of Adversarial Perturbations • [R. Gao+, 2019] Convergence of Adversarial Training in Overparametrized Neural Networks • [C. Qin+, 2019] Adversarial Robustness through Local Linearization • [C. Beckham+, 2019] On Adversarial Mixup Resynthesis • [Y. Huang+, 2019] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism • [N. Ivkin+, 2019] Communication-efficient Distributed SGD with Sketching • [K. Cao+, 2019] Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss • [S. Srinivas+, 2019] Full-Gradient Representation for Neural Network Visualization • [X. Xiao+, 2019] AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters [L. Chiza+, 2019] On Lazy Training in Differentiable Programming Reference (3/4) NeurIPS2019࿦จ(3/3)
  145. ɹ 152 Reference (4/4) εϥΠυ • ླ໦େ࣊, IBIS2012νϡʔτϦΞϧ, ౷ܭతֶशཧ࿦νϡʔτϦΞϧ: جૅ͔ΒԠ༻·Ͱ,

    2012 • Sanjeev Arora , ICML2018 Tutrial, Toward Theoretical Understanding of Deep Learning, 2018 • ླ໦େ࣊, େࡕେֶूதߨٛ, ਂ૚ֶशͷ਺ཧ, 2019 • ೋ൓ాಞ࢙, IBIS2019νϡʔτϦΞϧ, ֶशΞϧΰϦζϜͷେҬऩଋੑͱؼೲతόΠΞε, 2019 • ࠓઘҸ૱, IBIS2019νϡʔτϦΞϧ, ਂ૚ֶशͷ൚ԽޡࠩͷͨΊͷۙࣅੑೳͱෳࡶੑղੳ, 2019 • Ioannis Mitliagkas, Course at Mila winter 2019, Theoretical principles for deep learning, 2019 • ෱ݪ٢ത, CVPR2019 ໢ཏతαʔϕΠใࠂձ, Adversarial Examples ෼໺ͷಈ޲, 2019 • Matt Johnson, Daniel Duckworth, K-FAC and Natural Gradients, 2017 Web • https://huyenchip.com/2019/12/18/key-trends-neurips-2019.html • https://medium.com/@NeurIPSConf/what-we-learned-from-neurips-2019-data-111ab996462c • https://rajatvd.github.io/NTK/ ͦͷଞ࿦จ • [Xavier+, 2020] Xavier Bouthillier, Gaël Varoquaux. Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020. [Research Report] Inria Saclay Ile de France. 2020