Mind): Mastering the game of Go with deep neural networks and tree search, Nature, 529, 484— 489, 2016] [He, Gkioxari, Dollár, Girshick: Mask R-CNN, ICCV2017] [Brown et al. “Language Models are Few-Shot Learners”, NeurIPS2020] [Alammar: How GPT3 Works - Visualizations and Animations, https://jalammar.github.io/how-gpt3-works- visualizations-animations/] Performance of few-shot learning against model size Learning efficiency of few shot learning Large language model Generative models (diffusion models) Jason Allen "Théâtre D'opéra Spatial“ generated by Midjourney. Colorado State Fair’s fine art competition, 1st prize in digital art category [ChatGPT. OpenAI2022] [Ho, Jain, Abbeel: Denoising Diffusion Probabilistic Models. 2020] Stable diffusion, 2022. 様々なタスクで高い精度 なぜ?
Ali Rahimi’s talk at NIPS2017 (test of time award). “Random features for large-scale kernel methods.” • 中で何が行われているか分か らないものは用いたくない. • 企業の説明責任.深層学習の ホワイトボックス化. • 原理解明 • どうすれば“良い”学習が実現できるか?→新手法の開発 理論の必要性
Forward process Backward process どちらも(ほぼ)ミニマックス最適 [Yang & Barron, 1999; Niles-Weed & Berthet, 2022]. 経験スコアマッチング推定量: (for any 𝛿 > 0). 定理 Let 𝑌 be the r.v. generated by the backward process w.r.t. Ƹ 𝑠, then (Estimator for 𝑊1 distance requires some modification) (𝑠: 密度関数の滑らかさ) [Kazusato Oko, Shunta Akiyama, Taiji Suzuki: Diffusion Models are Minimax Optimal Distribution Estimators. ICML2023]
from a kernel perspective. AISTATS2018] [Li, Sun, Liu, Suzuki and Huang: Understanding of Generalization in Deep Learning via Tensor Methods. AISTATS2020] [Suzuki, Abe, Nishimura: Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network, ICLR2020] [Suzuki et al.: Spectral pruning: Compressing deep neural networks via spectral analysis and its generalization error. IJCAI-PRICAI 2020] 元サイズ 圧縮可能 サイズ 大 小 実質的自由度 元のサイズ [実験的観察] 実際に学習した ネットワークは圧縮しやすい. すぐ減衰 すぐ減衰 •中間層の分散共分散行列の固有値分布 •中間層の重み行列の特異値分布 が速く減衰するなら圧縮しやすい. 重み行列の特異値 分散共分散行列の固有値 分散共分散行列も重み行列も 特異値が速く減衰 →小さい統計的自由度 (AIC, Mallows’ Cp) カーネル法の理論 (そもそもカーネルは無限次元モデル) (次ページに詳細)
features regression: Precise asymptotics and double descent curve." arXiv preprint arXiv:1908.05355 (2019)] 2-layer neural network [Xu and Hsu: On the number of variables to use in principal component regression. NeurIPS2019.] Principal component regression (いくつの主成分を用いたか) Populationの分散共分散を知っているとして, その主成分を利用 (いくつのニューロンを用いたか) Sample size = # of features Sample size = # of features
➢ Marchenko–Pastur則, Stieltjes変換 58 • 集中不等式による評価 ➢ 有限サンプルサイズにおける予測誤差の上界評価 (𝑛 < ∞) ➢ 収束レートが評価できる. (𝑛 → ∞を取る前の振る舞いを評価) ◼ Dobriban&Wager: High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279, 2018. ◼ Hastie et al.: Surprises in High-Dimensional Ridgeless Least Squares Interpolation, arXiv:1903.08560. ◼ Song&Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. Communications on Pure and Applied Mathematics. arXiv:1908.05355 (2019). ◼ Belkin, Rakhlin&Tsybakov: Does data interpolation contradict statistical optimality? AISTATS2019. ◼ Bartlett, Long, Lugosi&Tsigler: Benign Overfitting in Linear Regression. PNAS, 117(48):30063-30070, 2020. ◼ Liang&Rakhlin: Just interpolate: Kernel “Ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347, 2020. • CGMT (Convex Gaussian min-max Theorem) ◼ Thrampoulidis, Oymak & Hassibi: Regularized linear regression: A precise analysis of the estimation error. COLT2015. ◼ Thrampoulidis, Abbasi & Hassibi: Precise error analysis of regularized m-estimators in high dimensions. IEEE Transactions on Information Theory, vol. 64, no. 8, pp. 5592–5628, 2018.
for the 2nd layer. RKHS w.r.t. NTK for the 1st layer. RKHS w.r.t. NTK for the both layer. • 二層NNのNTKによる学習は,multiple kernel learningの効果がある. • 多層NNを用いることはモデルmisspecificationに対してよりロバストになる. 一層目のNTK 二層目のNTK 一層目と二層目のカーネルの和:multiple kernel
ランダムウォークはフラットな領域に とどまりやすい •「フラット」という概念は座標系の取り 方によるから意味がないという批判. (Dinh et al., 2017) •PAC-Bayesによる解析 (Dziugaite, Roy, 2017) Keskar, Mudigere, Nocedal, Smelyanskiy, Tang (2017): On large-batch training for deep learning: generalization gap and sharp minima.
𝑋 𝑘 (𝑁) 𝒳𝑘 = 𝑋 𝑘 𝑖 𝑖=1 𝑁 ∼ 𝜇 𝑘 𝑁 : Joint distribution of 𝑁 particles. Potential of the joint distribution 𝝁 𝒌 (𝑵) on ℝ𝒅×𝑵 : where (Fisher divergence) where ➢ The finite particle dynamics is the Wasserstein gradient flow that minimizes (Approximate) Uniform log-Sobolev inequality [Chen et al. 2022] Recall [Chen, Ren, Wang. Uniform-in-time propagation of chaos for mean field langevin dynamics. For any 𝑵,