転移学習とドメイン適応の基礎

転移学習とドメイン適応の基礎第 17 回日本統計学会春季集会 2023 年 3 月 4 日
@ 東京都立大学松井孝太名古屋大学大学院医学系研究科生物統計学分野

Table of contents 1. はじめに: 統計的機械学習の定式化と困難 2. 転移学習の基本概念 3. 転移学習の方法
4. 教師なしドメイン適応の学習理論 5. まとめ松井 (名古屋大) 転移学習の基礎 1 / 54

はじめに: 統計的機械学習の定式化と困難

統計的機械学習 (教師あり学習) の問題設定 ▶ 教師データ Dn = {(xi, yi)}n i=1
⊂ X × Y • (xi , yi ) ∼i.i.d PX×Y • データはある確率分布から独立にサンプリングされていると仮定 ▶ 仮説 h : X → Y 入力から出力を予測する関数 ▶ 損失関数 ℓ : Y × Y → R≥0 予測の間違いに対する罰則 Definition 1 (期待リスク) データ分布 PX×Y の下での仮説 h の期待リスク R(h) := E(X,Y )∼PX×Y [ℓ(h(X), Y )] 期待リスクが小さい → h の PX×Y から生成されるデータに対する予測性能は高い期待リスクの小さい仮説を仮説集合 H から見つければ良い松井 (名古屋大) 転移学習の基礎はじめに: 統計的機械学習の定式化と困難 2 / 54

期待リスク最小な仮説の学習一般に PX×Y は未知なため, 期待リスクの代わりに経験リスク ˆ R(h) := 1 n
n i=1 ℓ(h(xi), yi) を最小化して仮説 h を学習する: ˆ h = arg min h∈H ˆ R(h) 経験リスク最小化の正当化 (大数の弱法則) (Xi, Yi) ∼i.i.d PX×Y のとき, ∀ε > 0 に対して lim n→∞ PrDn | ˆ R(h) − R(h)| > ε = 0 ▶ データが独立同一に分布 PX×Y から得られるとき, データ数を十分大きく取れば経験リスクと期待リスクの差は確率的に 0 に収束する (正確にはこれだけでは不十分) 松井 (名古屋大) 転移学習の基礎はじめに: 統計的機械学習の定式化と困難 3 / 54

経験損失最小化による予測モデル学習の実現推定されたパラメータモデルによる予測との真の出⼒の誤差パラメータ空間の中でが最⼩になるものを探す経験損失
(empirical risk) 例：勾配降下法で逐次的に更新例：0-1損失（分類問題）例：⼆乗損失（回帰問題）最適値 ▶ 各訓練データ (xi, yi) はデータ空間 X × Y 上の確率分布 PX×Y からの独立同一なサンプル (の実現値) 松井 (名古屋大) 転移学習の基礎はじめに: 統計的機械学習の定式化と困難 4 / 54

困難その 1: 超大規模化する深層学習モデル tificreports/ 2012 2013 2014 2015 2016 2017
2018 2019 2020 Year 105 106 107 108 109 1010 1011 Number of Model Parameters AlexNet VGG16 GoogLeNet ResNet-50 DQN Inception V3 Xception Transformer (Base) Transformer (Big) NASNet SENet BERT GPT-2 ALBERT Transformer-XL GPT-3 Figure 1. Number of parameters, i.e., weights, in recent landmark neural networks 1,2,31–43 (references da rst release, e.g., on arXiv). e number of multiplications (not always reported) is not equivalent to the n of parameters, but larger models tend to require more compute power, notably in fully-connected layers. [Bernstein+ 2021] Figure 1 6千万 1.4億 1750億 ▶ 深層学習モデルのサイズは指数的に増大している松井 (名古屋大) 転移学習の基礎はじめに: 統計的機械学習の定式化と困難 5 / 54

困難その 1: 超大規模化する深層学習モデル (BLEU) compared against bilingual baselines on our
massively multilingual in-house corpus, with increasing model size. Each point, T(L, H, A), depicts the performance of a Transformer with L encoder and L decoder layers, a feed-forward hidden dimension of H and A attention heads. Red dot depicts the performance of a 128-layer 6B parameter Transformer. models into partitions and to assign different partitions to different accelerators. However, efficient model parallelism algorithms are extremely hard to design and implement, which often requires the practitioner to make difficult choices among scaling capacity, flexibility (or specificity to particular tasks and architectures) and training efficiency. As a result, most efficient model-parallel algorithms are architecture and task-specific. With the growing number of applications of deep learning, there is an ever-increasing demand for reliable and flexible infrastructure that allows researchers to easily scale neural networks for a large variety of machine learning tasks. Dataset Size tokens Parameters non-embedding Compute PF-days, non-embedding Test Loss Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not モデルのパラメータ数を増やすほど精度が上昇（過剰適合しない？）データ数, パラメータ数, 計算リソースに対して誤差のべき乗則 (scaling law) が報告されている [Huang+ 2019] Figure 1 [Kaplan+ 2020] Figure 1 松井 (名古屋大) 転移学習の基礎はじめに: 統計的機械学習の定式化と困難 6 / 54

例：GPT-3 [Brown+ 2020] Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination Article: Figure 3.14: The GPT-3 generated news article that humans had the greatest difﬁculty distinguishing from a human written article (accuracy: 12%). % 0 6 3 #/!4$. 15 # ). &. 2'. (!,+* "- ▶ 最大サイズの GPT-3 175B は4990 億トークンで訓練 • 前身の GPT-2 1.5B は 40GB (100 億トークン相当) のテキストで訓練 ▶ 市場で最も低価格な GPU クラウドを利用した場合, 1 回の訓練に 355GPU 年と 460 万ドル ▶ 1750 億個のパラメータ格納には700GB のメモリが必要 • 例えば Quadro RTX8000 の最大メモリは 48GB で一桁大きい数値の引用元：https://lambdalabs.com/blog/demystifying-gpt-3 松井 (名古屋大) 転移学習の基礎はじめに: 統計的機械学習の定式化と困難 7 / 54

大規模化する機械学習の困難 ▶ 巨大なモデルを (ゼロから) 学習させられるほどの計算リソースがない • 現在公開されている大規模モデルのほとんどはいわゆる巨大 Tech
企業が莫大な計算リソースをかけて開発している ▶ 複雑なモデルを適切に訓練させられるほどの学習データ (教師付きデータ) がない • 最近では自己教師あり学習という教師なしの訓練方法が発達しており, 画像認識や自然言語処理分野では既に広く利用されている松井 (名古屋大) 転移学習の基礎はじめに: 統計的機械学習の定式化と困難 8 / 54

困難その 2: データの取得そのものが難しい問題名古屋⼤学病院前橋⾚⼗字病院⼩牧市⺠病院中東遠総合医療センター • 4施設から⼼停⽌後症候群 (PCAS)
患者の臨床情報（年齢，性別，頭部CT…）とアウトカム（予後良好 or 不良）が取得されている（nは患者数） • これらのデータから新たなPCAS患者の予後を予測したい疾患の予後予測モデルの開発 [Nishikimi+ 2017, Matsui+ 2018] ⼤規模病院市中病院 § 各病院で個別に予測モデルを開発するにはデータ数が少ない全ての病院のデータを統合してモデルを作る必要がある § しかし病院ごとに患者集団の傾向が異なる可能性がある e.g. ⼤規模病院には重症患者が集まりやすく，市中病院には軽症患者が集まりやすい § また病院ごとにデータの測定項⽬にばらつきがある単純なデータテーブルの結合はうまくいかない松井 (名古屋大) 転移学習の基礎はじめに: 統計的機械学習の定式化と困難 9 / 54

小データな機械学習へのアプローチとしての転移学習 ▶ 巨大なモデルを学習させられるほどの計算リソースがない → モデルの中で学習する部分を減らす ▶ 複雑なモデルを訓練させられるほどの学習データがない → 他の問題で得られたデータなどを補助情報として利用する
▶ 統合したい各データ集合が異なる傾向を持つ → データ集合間の「違い」を補正するこれらは転移学習 (transfer learning) によって実現することができる本日の目標転移学習の基本的な考え方を理解する松井 (名古屋大) 転移学習の基礎はじめに: 統計的機械学習の定式化と困難 10 / 54

転移学習の基本概念

転移学習とは
Inductive Transfer : 10 Years Later (NIPS’05 Workshop) における定義帰納的転移または転移学習とは, 新しいタスクに対する有効な仮説を効率的に見つけ出すために, 一つ以上の別のタスクで学習された知識を保持 · 適用する問題を指す. → 機械学習の問題としてどう定式化され，どう解かれるのか？松井 (名古屋大) 転移学習の基礎転移学習の基本概念 11 / 54

転移学習の問題設定目標ドメインで期待リスク最小の仮説を学習したい 75 8, 9+!& "6(08, * ) 75
!&%# 4 .' 3/= "9+ 2 :-$ 1 " -$<; ▶「よく似た」元ドメインの知識を利用することで, 目標ドメインのみでは難しかった学習を可能にする ▶ 元ドメイン = 目標ドメインのとき前者を訓練データ, 後者をテストデータと考えれば通常の機械学習と一致松井 (名古屋大) 転移学習の基礎転移学習の基本概念 12 / 54

転移学習の 2W1H 1. When to transfer (いつ転移するか) • 一般に元ドメインと目標ドメインは異なる •
直観的には 2 つのドメインが似ていると転移が上手くいく可能性が高い • 不一致度などによってドメイン間の非類似度を定量化 • 転移仮定 (知識転移を成功させるために仮定をおく) • ドメインの非類似性に起因する負転移を回避したい • 転移によって目標ドメインの性能がむしろ悪化する現象 2. What to transfer (何を転移するか) • 元ドメインから目標ドメインにどのような「知識」を転移するのか 3. How to transfer (どう転移するか) • 具体的な転移学習アルゴリズム松井 (名古屋大) 転移学習の基礎転移学習の基本概念 13 / 54

いつ転移するか？負転移 (negative transfer) 1. 一方のドメインのみで学習したモデルを用いる 2. 転移学習によって学習したモデルを用いるで (2
の目標タスク性能) ≤ (1 の目標タスク性能) のとなる現象直観的にはドメインが乖離しているほど負転移が発生しやすい 1.0 0.2 0.4 0.6 0.8 0.0 1.0 0.2 0.4 0.6 0.8 0.0 AUC AUC The number of target training cases The number of target training cases (a) (b) source only transfer target only source only transfer target only 松井 (名古屋大) 転移学習の基礎転移学習の基本概念 14 / 54

いつ転移するか？一般に, 元ドメインと目標ドメインは異なる (ドメインシフト) → ドメインの非類似性に対して様々な仮定をおく (転移仮定) 同質的ドメインシフト異質的ドメインシフト分布シフト
distribution shift 元ドメイン⽬標ドメイン⽬標ドメインは既知（学習時ににアクセスできる）転移仮定 [Quionero-Candela+ ‘09] • 共変量シフト (covariate shift) • ラベル事前確率シフト (label prior shift) • サンプル選択バイアス (sample selection bias) • クラスバランスシフト (class balance shift) 異質的転移 heterogeneous transfer ⻩⾊く細⻑い果物で、⽪を剥いて⾷べる… 元ドメイン⽬標ドメイン⽬標ドメインは既知（学習時ににアクセスできる）転移仮定 [Duan+ ’12, Ganin+ ‘15] • 共通の特徴（潜在）空間元ドメイン⽬標ドメイン特徴抽出特徴空間ドメイン汎化 domain generalization ⽬標ドメインは未知（学習時ににアクセスできない）同質的なケース (ラベル空間が共通) [Zhou+ ‘21] Style Transfer 異質的なケース (ラベル空間が異なる) [Rebuffi+ ‘17] Visual Decathlon 松井 (名古屋大) 転移学習の基礎転移学習の基本概念 15 / 54

いつ転移するか？ドメインの不一致度 (discrepancy) : 両ドメインの確率分布の分布間 (疑) 距離で非類似度を測る
▶ 不一致度が小さいとき, 目標ドメインのデータは元ドメインとよく似た生成メカニズムを持っていると考える ▶ 様々な discrepancy が定義されている • H ダイバージェンス [Ben-David+ ’10] • Wasserstein 距離 [Courty+ ’17] • source-guided discrepancy [Kuroki+ ’19] 松井 (名古屋大) 転移学習の基礎転移学習の基本概念 16 / 54

例: H-ダイバージェンス同質的ドメインシフト, 2 値判別を考える H-ダイバージェンス dH( ˆ
PS X , ˆ PT X ) := 2 sup h∈H 1 nS nS i=1 h(xS i )=1 − 1 nT nT i=1 h(xT i )=1 同値な変換: 2 sup h∈H 1 nS nS i=1 h(xS i )=1 − 1 nT nT i=1 h(xT i )=1 = 2 sup h∈H 1− 1 nS nS i=1 h(xS i )=0 − 1 nT nT i=1 h(xT i )=1 = 2 sup h∈H 1 − 1 nS nS i=1 h(xS i )=0 + 1 nT nT i=1 h(xT i )=1 松井 (名古屋大) 転移学習の基礎転移学習の基本概念 17 / 54

例: H-ダイバージェンス特に H が対称な仮説集合 (h ∈ H ⇒ h′
= 1 − h ∈ H) のとき dH( ˆ PS X , ˆ PT X ) = 2 1 − inf h∈H 1 nS nS i=1 h(xS i )=0 + 1 nT nT i=1 h(xT i )=1 ▶ 両ドメインで同じ仮説を使ったとき, クラス 1 に分類されるデータの割合の差の最大値 ▶ 仮説 h が入力 x の由来ドメインを当てられる性能が高いほど大きくなる • 上の式から入力 x のラベルの 0 or 1 を当てるのと x が元ドメインのデータか目標ドメインのデータかを当てるのが等価松井 (名古屋大) 転移学習の基礎転移学習の基本概念 18 / 54

例: Wasserstein 距離最適輸送問題ある場所で採掘した土を別の場所で工事に用いるとき, 輸送コストを最小にする輸送方法を求めよ Monge による定式化:
min τ:X→X X ∥x − τ(x)∥dPS X Kantorovich による定式化: min π∈Π(PS X ,PT X ) X×X ∥x − x′∥dπ(x, x′) 松井 (名古屋大) 転移学習の基礎転移学習の基本概念 19 / 54

例: Wasserstein 距離 ▶ p 次 Wasserstein 距離 Wp(PS X
, PT X ) = inf π∈Π(PS X ,PT X ) X×X ∥x, x′∥pdπ(x, x′) 1/p • PS X , PT X は p 乗可積分な確率分布: X ∥x, x′∥pdP(x) < ∞, ∀x′ ∈ X • π ∈ Π(PS X , PT X ) は X × X 上の分布で π(·, X) = PS X (·), π(X, ·) = PT X (·) を満たすもの (PS X と PT X のカップリング) ▶ Kantrovich-Rubinstein 双対性 (p = 1 のとき) W1(PS X , PT X ) = sup f:1-Lip Ex∼PS X [f(x)] − Ex∼PT X [f(x)] • 1-Lip : 1-リプシッツ連続な関数 i.e. |f(x) − f(x′)|/∥x, x′∥ ≤ 1 松井 (名古屋大) 転移学習の基礎転移学習の基本概念 20 / 54

積分確率計量 (integral probability metric, IPM) 積分確率計量 (IPM) [Sriperumbudur+ 2009]
γG(P, Q) := sup g∈G |EP [g] − EQ[g]| ▶ 関数空間 G のとり方で様々な不一致度を表現できる • Wasserstein 距離, 全変動距離, カーネル MMD などを含む • e.g. G : 1-Lip 関数全体のとき, γG (P, Q) = W1 (P, Q) 定理 [Sriperumbudur+ ’09] 確率 1 − δ 以上で以下のサンプル近似上界が成立 |γG (PT , PS ) − γG ( ˆ PT , ˆ PS )| ≤ 2ℜT,nT (G) + 2ℜS,nS (G) + M √ 18 log 4 δ ( 1 √ nT + 1 √ nS ) ▶ ℜT,nT (G), ℜS,nS (G) : G のラデマッハ複雑度 ▶ nT , nS : 各ドメインのサンプルサイズ ▶ M = supx∈X,g∈G g(x) 松井 (名古屋大) 転移学習の基礎転移学習の基本概念 21 / 54

何を転移するか？元ドメインから目標ドメインへ転移する「知識」によって定式化や方法が変わるデータ集合例：重要度重み付き学習特徴量例：ドメイン敵対的学習学習済の仮説例：事前学習とファインチューン元ドメイン
⽬標ドメイン事例転移特徴転移パラメータ転移⽬標ドメインのリスク最⼩の仮説転移学習の⽬的松井 (名古屋大) 転移学習の基礎転移学習の基本概念 22 / 54

転移学習の方法

重要度重み付き学習 (事例転移) [Sugiyama+ 2012] 推定されたパラメータ転移仮定（共変量シフト）⽬標ドメインでの経験リスク最⼩化
共変量シフトの仮定のもとで⽬標ドメインのリスク元ドメインの重み付きリスク⼊⼒分布の密度⽐（分布のずれを表現） ▶ 目標ドメインのリスク ≈ 元ドメインの重み付きリスク松井 (名古屋大) 転移学習の基礎転移学習の方法 23 / 54

重要度重み付き学習 (事例転移) [Sugiyama+ 2012] 重要度重み付き経験損失推定されたパラメータ共変量シフトの仮定のもとで⽬標ドメインのリスク = 元ドメインの重み付きリスク
上式を最⼩化すれば⽬標ドメインのモデルパラメータが得られる解釈密度⽐が⼤きい → ⽬標ドメインに近い⽬標ドメインに近いデータほど学習に寄与重み（密度⽐）は未知なのでデータから推定重要度重み：密度⽐重要度重み：密度⽐推定⽅法：制約なし最⼩⼆乗重要度フィッティング [Kanamori+ 2009] ▶ 元ドメインのデータを使って目標ドメインのモデルを学習松井 (名古屋大) 転移学習の基礎転移学習の方法 24 / 54

重要度重み付き学習の特徴 ▶ 2 段階の方法 Step 1: 密度比関数を推定 Step 2: 推定した密度比を用いて重み付き経験損失
最小化 • Step 1 では両ドメインの入力データ {xS i }nS i=1 , {xT j }nT j=1 のみを用いる教師なし学習 • Step 2 では元ドメインの教師データ {(xS i , yS i )}nS i=1 のみを用いる教師付き学習 ⇒ 目標ドメインの出力データ yT j がなくても実行可能 (教師なしドメイン適応) ▶ 密度比推定はモデルと独立しているのでモデルに依らずに利用できる ▶ 同質的ドメインシフトでないと利用できない松井 (名古屋大) 転移学習の基礎転移学習の方法 25 / 54

重要度重み付き学習の実行例 Toy example によるデモ実験データ • Two Moonsデータセットを利⽤元ドメイン⽬標ドメイン
Ø 2次元のデータ Ø 上弦部分：ラベル0（⾚） Ø 下弦部分：ラベル1（⻘） Ø 元データを原点中⼼に30度回転 Ø 教師（ラベル）無し（⼊出⼒関係は同じと仮定） X1 X2 Y -0.574753 0.993981 0 1.246100 -0.207939 1 0.473573 0.093938 0 -0.466737 0.147854 1 0.498300 -0.506125 1 … X1 X2 -0.994741 0.573436 1.183124 0.442970 0.363157 0.318140 -0.478133 -0.105323 0.684604 -0.189167 … * Two Moons: https://adapt-python.github.io/adapt/examples/Two_moons.html 松井 (名古屋大) 転移学習の基礎転移学習の方法 26 / 54

重要度重み付き学習の実行例 Toy example によるデモ密度⽐推定の結果 • 密度⽐が⼤きい点を⼤きく、⼩さい点を⼩さく表⽰⽬標ドメインとずれが⼤きい上下の重みを⼩さく評価
⽬標ドメインとずれが⼩さい真ん中から左側の重みを強く評価 ※値の⼤⼩を強調するため密度⽐はMAX-MINスケーリングして表⽰松井 (名古屋大) 転移学習の基礎転移学習の方法 26 / 54

重要度重み付き学習の実行例 Toy example によるデモ密度⽐で重み付けした結果の決定境界重み付けなし（元ドメインのデータを⼀様重みで学習）重み付けあり（⽬標ドメインのデータを参照し重み付きで学習）左下部分の
決定境界が改善 acc: 0.82 acc: 0.87 松井 (名古屋大) 転移学習の基礎転移学習の方法 26 / 54

ドメイン敵対的学習 (特徴転移) [Ganin+ 2015]
▶ ニューラルネットの中間層でドメイン不変な特徴抽出器/特徴表現を学習する ▶ 不変特徴表現 ≈ 共通空間への特徴抽出 + 分布マッチング ▶ 異質的ドメインシフトでも利用できる松井 (名古屋大) 転移学習の基礎転移学習の方法 27 / 54

ドメイン敵対的学習 (特徴転移) [Ganin+ 2015]
▶ 敵対的生成ネットワーク (GAN) と同じ発想で特徴抽出器とドメイン識別器を敵対的訓練 ▶ 特徴抽出器はドメイン識別器がどちらのドメイン由来か当てられないような特徴を作る → ドメイン不変な特徴量松井 (名古屋大) 転移学習の基礎転移学習の方法 28 / 54

ドメイン敵対的学習の解釈: ドメイン不一致度の最小化中間層で H-ダイバージェンス (ドメインの不一致度) を最小化元ドメインデータの特徴量予測ラベル元ドメインデータ
の特徴量予測ドメインラベル⽬標ドメインデータの特徴量予測ドメインラベル観測ラベル観測ドメインラベル損失関数の気分 : 元ドメインの特徴集合 : ⽬標ドメインの特徴集合ドメイン識別器を学習して推定経験 -ダイバージェンス松井 (名古屋大) 転移学習の基礎転移学習の方法 29 / 54

Wasserstein 距離に基づくドメイン敵対的学習 [Shen+ 2018] Wasserstein 距離による不変表現学習 min θg
max θw LWD(xS, xT ) − γLgrad( ˆ F) ▶ LWD (xS, xT ) = 1 nS xS hθw (fθg (xS)) − 1 nT xT hθw (fθg (xT )) • PS X と PT X の間の経験 Wasserstein 距離の KR 双対表現 ▶ Lgrad ( ˆ F) = (∥∇ ˆ F hθw ( ˆ F)∥ − 1)2, ˆ F ∈ 線分 fθg (xS)-fθg (xT ) • hθw のリプシッツ性を保証するための罰則項 ▶ 中間層で Wasserstein 距離を最小化 ▶ fθg で抽出した特徴量がどちらのドメイン由来かを hθw で識別松井 (名古屋大) 転移学習の基礎転移学習の方法 30 / 54

Wasserstein 距離に基づくドメイン敵対的学習 [Shen+ 2018] 元ドメインデータ⽬標ドメインデータ特徴抽出器 …
… 元ドメイン特徴量⽬標ドメイン特徴量 … … … … ラベル識別器ドメインクリティック識別誤差 Wasserstein距離実際にはラベル識別器 hθc も同時学習: min θg,θc Lc (xS, yS) + λ max θw LWD (xS, xT ) − γLgrad ( ˆ F) ▶ Lc (xS, yS) = − 1 nS nS i=1 ℓ k=1 1{yS i =k} log hθc (fθg (xS i ))k 松井 (名古屋大) 転移学習の基礎転移学習の方法 31 / 54

事前学習とファインチューニング (パラメータ転移) パラメータ転移の方法. 最もよく使われる転移学習手法 · · · ! .
&0' /- . · · · &0 1$ ,)(& + ,)(& ! 1 ,)(&&0' /- #% ,)(&"*&0' /- #%+% 松井 (名古屋大) 転移学習の基礎転移学習の方法 32 / 54

トランスフォーマーによる大規模事前学習済みモデル
▶ GPT：TF のデコーダー部分を事前学習 • 左から右への単語予測タスク ▶ BERT：TF のエンコーダー部分を事前学習 (上図1) • 任意位置のマスクされた単語予測タスク松井 (名古屋大) 転移学習の基礎転移学習の方法 33 / 54

事前学習済みモデルの良さとは？ NN モデルの事前学習とファインチューン · · · !
. &0' /- . · · · &0 1$ ,)(& + ,)(& ! 1 ,)(&&0' /- #% ,)(&"*&0' /- #%+% 素朴な疑問 ▶ どういうときに転移すればいいの？（転移可能性） ▶ 本当にうまくいくの？（性能の理論保証）松井 (名古屋大) 転移学習の基礎転移学習の方法 34 / 54

事前学習済みモデルの転移可能性 ResN et-18 ResN et-34 ResN et-50 M obileN et0.25
M obileN et0.5 M obileN et0.75 M obileN et1.0 0.4 0.5 0.6 0.7 0.8 0.9 1 Full Target Data Scarce Target Data Pre-trained Models Test Accuracy (a) Test accuracy for pre-training Ima- geNet with different model architectures. 0 1 2 3 4 5 6 7 8 0.4 0.5 0.6 0.7 0.8 0.9 1 Full Target Data Scarce Target Data # of Layers NOT Being Transferred Test Accuracy (b) Test accuracy for transferring different layers of the pre-trained ResNet34 model. Transferring from ImageNet to CIFAR-100. For “Full Target Data”, all target data are used, while get Data”, only 50 target samples per class are used in training. learning to measure the transferability, and thereby require fine-tuning on a target task w parameter optimization. Though their follow-ups [7, 8] alleviate the need of fine-tuning, t 異なるアーキテクチャの事前学習モデルをImageNetからCIFAR-100へ転移したときのテスト精度 ResN et-18 ResN et-34 ResN et-50 M obileN et0.25 M obileN et0.5 M obileN et0.75 M obileN et1.0 0.4 0.5 0.6 0.7 0.8 0.9 1 Full Target Data Scarce Target Data Pre-trained Models Test Accuracy (a) Test accuracy for pre-training Ima- geNet with different model architectures. 0 1 2 3 4 5 6 7 8 0.4 0.5 0.6 0.7 0.8 0.9 1 Full Target Data Scarce Target Data # of Layers NOT Being Transferred Test Accuracy (b) Test accuracy for transferring different layers of the pre-trained ResNet34 model. 1: Transferring from ImageNet to CIFAR-100. For “Full Target Data”, all target data are used Target Data”, only 50 target samples per class are used in training. sfer learning to measure the transferability, and thereby require fine-tuning on a target ive parameter optimization. Though their follow-ups [7, 8] alleviate the need of fine-tun ResNet34の異なる層をImageNet からCIFAR-100へ転移したときのテスト精度 Full Target Data: ⽬標ドメインデータを全て使う Scarce Target Data: ⽬標ドメインデータは50点のみ全層転移最終層は転移しない Fig: [Huang+ 2022] Figure 1 ▶ 上位層では元ドメインのタスクに特化した特徴が得られやすい一方，下位層では汎用的な特徴が得られやすい松井 (名古屋大) 転移学習の基礎転移学習の方法 35 / 54

事前学習済みモデルの転移可能性を評価する例: Taskonomy [Zamir+ ’18] 26 種類の画像関連タスクの全ての組合せで網羅的に転移学習を実行，親和性の高いタスクのペアを見つける Denoising Autoencoding
2D Edges 2D Keypoints 2D Segm. Normals Object Class. (1000) Scene Class. Curvature Occlusion Edges Egomotion Cam. Pose (fix) 3D Keypoints Cam. Pose (nonfix) Matching Reshading Z-Depth Distance Layout 2.5D Segm. Semantic Segm. Vanishing Pts. Colo o o oriza a a ation In-paint t t ti ing g g g Jig g g gsaw w w w Rando o o om P P Proj o . 3D Keypoints Autoencoding Object Class. (1000) Denoising 2D Keypoints Matching Normals Z-Depth Distance 2.5D Segm. Scene Class. 2D Edges Egomotion Cam. Pose (fix) Curvature Occlusion Edges Reshading Semantic Segm. Cam. Pose (nonfix) Layout 2D Segm. Vanishing Pts. Color ri i i izat t t ti i i ion In-p p p pain n n nting Ji i i ig g g gsaw w w w Rand d d dom Proj o . Scene Class. Object Class. (1000) Denoising Autoencoding 2D Keypoints 2D Segm. In-painting 2D Edges Normals Curvature Occlusion Edges Egomotion Cam. Pose (fix) 3D Keypoints Cam. Pose (nonfix) Matching Reshading Z-Depth Distance Layout 2.5D Segm. Semantic Segm. Vanishing Pts. Colo o o oriza a a ation J g ig g g gsaw w w w Rando o o om Proj o . Scene Class. Object Class. (1000) Denoising Autoencoding 2D Segm. 2D Edges 2D Keypoints In-painting 3D Keypoints Matching Normals Z-Depth Distance 2.5D Segm. Egomotion Cam. Pose (fix) Curvature Occlusion Edges Reshading Semantic Segm. Cam. Pose (nonfix) Layout Vanishing Pts. Colo o o or r riza a a ation Jigsa a a aw Rando o o om m m m Proj o . Object Class. (1000) Curvature Scene Class. Semantic Segm. Normals 3D Keypoints Denoising Autoencoding 2D Segm. 2D Edges 2D Keypoints In-painting Colorization Matching 2.5D Segm. Z-Depth Distance Egomotion Cam. Pose (fix) Occlusion Edges Reshading Cam. Pose (nonfix) Layout Vanishing Pts. i i ig g g gsa a a aw R R R Ra a a a an n n ndom Proj o . A C Supervision Budget 2 Transfer Order 1 Transfer Order 2 Transfer Order 4 Supervision Budget 8 3D Keypoints Autoencoding Object Class. (1000) Denoising 2D Keypoints Matching Normals Z-Depth Distance 2.5D Segm. Scene Class 2D Edges Egomotion Cam. Pose (fix) Curvature Occlusion Edges Reshading Semantic Segm. Cam. Pose (nonfix) Layout 2D Segm. Vanishing Pts. Color ri i i izat t t ti i i ion In-p p p pain n n nting Rand d d dom Proj o . Ji i i ig g g gsaw w w w c Object Class. (1000) Curvature Scene Class. Semantic Segm. Normals 3D Keypoints Denoising Autoencoding 2D Segm. 2D Edges 2D Keypoints In-painting Colorization Matching 2.5D Segm. Z-Depth Distance Egomotion Cam. Pose (fix) Occlusion Edges Reshading Cam. Pose (nonfix) Layout Vanishing Pts. Jigs s s saw R R R Ra an n n ndom Proj o . 松井 (名古屋大) 転移学習の基礎転移学習の方法 36 / 54

事前学習済みモデルの転移可能性を評価する例: Taskonomy [Zamir+ ’18] タスク間のエッジ生成までの流れ 2nd 3rd Frozen 1st
Order Order Order Task-specific 2D Segm. 3D Keypoints 2.5D Segm Normals Reshading Layout 2D Segm. 3D Keypoints 2.5D Segm Normals Reshading Layout (I) Task-specific Modeling (II) Transfer Modeling (III) Task Affinity Normalization (IV) Compute Taxonomy Output space Task Space (representation) Input space Object Class. (100) Object Class. (1000) Curvature Scene Class. 0) Semantic Segm. Normals 3D Keypoints Denoising Autoencoding 2D Segm. 2D Edges 2D Keypoints In-painting Colorization Matching 2.5D Segm. Z-Depth Distance s. Egomotion Cam. Pose (fix) Occlusion Edges Reshading Cam. Pose (nonfix) Layout ose Vanishing Pts. Jigsaw Jigs Jigs Jigs Jigsa Jigs Jigs Jigs Random m Proj. Binary Integer Program AHP task affinities . . . . . . Figure 2: Computational modeling of task relations and creating the taxonomy. From left to right: I. Train task-specific networks. II. Train (first order and higher) transfer functions among tasks in a latent space. III. Get normalized transfer affinities using AHP (Analytic Hierarchy Process). IV. Find global transfer taxonomy using BIP (Binary Integer Program). we want solved but cannot train (“target-only”), T ∩ S are the tasks that we want solved but could play as source too, and S − T ∩ S are the “source-only” tasks which we may not directly care about to solve (e.g. jigsaw puzzle) but can be optionally used if they increase the performance on T . The task taxonomy (taskonomy) is a computationally found directed hypergraph that captures the notion of task transferability over any given task dictionary. An edge between a group of source tasks and a target task represents a シンプルな教師あり学習全タスクの組合せで転移（mul6-sourceの場合も含む） cally compute the ground truth for many tasks without human labeling. For the tasks that still require labels (e.g. scene classes), we generate them using Knowledge Distil- lation [41] from known methods [101, 55, 54, 75]. See the supplementary material for full details of the process and a user study on the final quality of labels generated using Knowledge Distillation (showing < 7% error). 3.1. Step I: Task-Specific Modeling We train a fully supervised task-specific network for each task in S. Task-specific networks have an encoder- decoder architecture homogeneous across all tasks, where the encoder is large enough to extract powerful representations, and the decoder is large enough to achieve a good performance but is much smaller than the encoder. 3.2. Step II: Transfer Modeling Given a source task s and a target task t, where s ∈ S and t ∈ T , a transfer network learns a small readout function for t given a statistic computed for s (see Fig 4). The statistic is the representation for image I from the encoder of s: Es (I). The readout function (Ds→t) is parameterized by θs→t minimizing the loss Lt: Ds→t := arg min θ EI∈D Lt Dθ Es (I) , ft (I) , (1) where ft (I) is ground truth of t for image I. Es (I) may or may not be sufficient for solving t depending on the relation between t and s (examples in Fig. 5). Thus, the performance of Ds→t is a useful metric as task affinity. We train transfer functions for all feasible source-target combinations. Accessibility: For a transfer to be successful, the latent representation of the source should both be inclusive of sufficient information for solving the target and have the information accessible, i.e. easily extractable (otherwise, the raw image or its compression based representations would contain complementary information for solving a target task (see examples in Fig 6). We include higher-order transfer which are the same as first order but receive multiple rep resentations in the input. Thus, our transfers are function D : ℘(S) → T , where ℘ is the powerset operator. As there is a combinatorial explosion in the number o feasible higher-order transfers (|T | × |S| k for kth order) we employ a sampling procedure with the goal of filtering out higher-order transfers that are less likely to yield good results, without training them. We use a beam search: fo transfers of order k ≤ 5 to a target, we select its 5 bes sources (according to 1st order performances) and include all of their order-k combination. For k ≥ 5, we use a beam of size 1 and compute the transfer from the top k sources. We also tested transitive transfers (s → t1 → t2) which showed they do not improve the results, and thus, were no include in our model (results in supplementary material). 3.3. Step III: Ordinal Normalization using Analytic Hierarchy Process (AHP) We want to have an affinity matrix of transferabilitie across tasks. Aggregating the raw losses/evaluations Ls→ from transfer functions into a matrix is obviously problem atic as they have vastly different scales and live in differen spaces (see Fig. 7-left). Hence, a proper normalization i needed. A naive solution would be to linearly rescale each row of the matrix to the range [0, 1]. This approach fail when the actual output quality increases at different speed w.r.t. the loss. As the loss-quality curve is generally un known, such approaches to normalization are ineffective. Instead, we use an ordinal approach in which the outpu quality and loss are only assumed to change monotonically For each t, we construct Wt a pairwise tournament matrix between all feasible sources for transferring to t. The ele ment at (i, j) is the percentage of images in a held-out tes set, D , on which s transfered to t better than s did (i.e Source Task Encoder Target Task Output (e.g., curvature) Frozen Representation Transfer Function 2nd order 3rd order ... (e.g., surface normal) I s E s→t D s E (I) Figure 4: Transfer Function. A small readout function is trained to map Layout Layout Reshade Input Ground Truth Task Specific Reshade Input Ground Truth Task Specific 2.5D Segmentation Surface Normal Estimation Figure 5: Transfer results to normals an vs の勝率から類似度⾏列を構成（各タスクが転移でどれだけ改善したかを⽐較） Image GT (Normals) Fully Supervised Image GT (Reshade) Fully Supervised { 3D Keypoints Surface Normals } 2nd order transfer + = { Occlusion Edges Curvature } 2nd order transfer + = Figure 6: Higher-Order Transfers. Representations can contain complementary information. E.g. by transferring simultaneously from 3D Edges and Curvature individual stairs were brought out. See our publicly available interactive transfer visualization page for more examples. wi,j = EI∈Dtest [Dsi →t (I) > Dsj →t (I)] EI∈Dtest [Dsi →t (I) < Dsj →t (I)] . (2) We quantify the final transferability of si to t as the cor- responding (ith) component of the principal eigenvector of Wt (normalized to sum to 1). The elements of the principal eigenvector are a measure of centrality, and are proportional to the amount of time that an infinite-length random walk on Wt will spend at any given source [59]. We stack the principal eigenvectors of Wt for all t ∈ T , to get an affinity matrix P (‘p’ for performance)—see Fig. 7, right. This approach is derived from Analytic Hierarchy Pro- cess [76], a method widely used in operations research to create a total order based on multiple pairwise comparisons. 3.4. Step IV: Computing the Global Taxonomy Autoencoding Scene Class Curvature Denoising 2D Edges Occlusion Edges 2D Keypoint 3D Keypoint Reshading Z-Depth Distance Normals Egomotion Vanishing Pts. 2D Segm. 2.5D Segm. Cam. Pose (fix) Cam. Pose (nonfix) Layout Matching Semantic Segm. Jigsaw In-Painting Colorization Random Proj. Task-Specific Object Class (100) Autoencoding Scene Class Object Class (100) Colorization Curvature Denoising Occlusion Edges 2D Edges Egomotion Cam. Pose (fix) In-Painting Jigsaw 2D Keypoint 3D Keypoint Cam. Pose (nonfix) Matching Random Proj. Reshading Z-Depth Distan N Autoencoding Object Class. (1000) Scene Class Curvature Occlusion Edges Egomotion Cam. Pose (fix) 2D Keypoint Layout Matching 2D Segm. Distance 2.5D Segm. Z-Depth Normals 3D Keypoint Denoising 2D Edges Cam. Pose (nonfix) Reshading Semantic Segm. Vanishing Pts. Object Class. (1000) Object Class. (1000) Figure 7: First-order task affinity matrix before (left) and Analytic Hierarchy Process (AHP) normalization. Lower m transfered. For visualization, we use standard affinity-distan dist = e−β·P (where β = 20 and e is element-wise matrix ex See supplementary material for the full matrix with higher-orde the relative importance of each target task and i sp the relative cost of acquiring labels for each task. The BIP is parameterized by a vector x where ea fer and each task is represented by a binary variabl cates which nodes are picked to be source and wh fers are selected. The canonical form for a BIP is: maximize cT x , subject to Ax b and x ∈ {0, 1}|E|+|V| . Each element ci for a transfer is the product o portance of its target task and its transfer performa ci := rtarget(i) · pi . Hence, the collective performance on all targets is mation of their individual AHP performance, pi, by the user specified importance, r . { 3D Keypoints Surface Normals } 2nd order transfer + = n Edges Curvature } 2nd order transfer + = e 6: Higher-Order Transfers. Representations can contain com- ntary information. E.g. by transferring simultaneously from 3D and Curvature individual stairs were brought out. See our publicly ble interactive transfer visualization page for more examples. wi,j = EI∈Dtest [Dsi →t (I) > Dsj →t (I)] EI∈Dtest [Dsi →t (I) < Dsj →t (I)] . (2) e quantify the final transferability of si to t as the cor- nding (ith) component of the principal eigenvector of normalized to sum to 1). The elements of the principal vector are a measure of centrality, and are proportional amount of time that an infinite-length random walk on will spend at any given source [59]. We stack the prin- eigenvectors of Wt for all t ∈ T , to get an affinity x P (‘p’ for performance)—see Fig. 7, right. his approach is derived from Analytic Hierarchy Pro- [76], a method widely used in operations research to e a total order based on multiple pairwise comparisons. Step IV: Computing the Global Taxonomy Autoencoding Scene Class Curvature Denoising 2D Edges Occlusion Edges 2D Keypoint 3D Keypoint Reshading Z-Depth Distance Normals Egomotion Vanishing Pts. 2D Segm. 2.5D Segm. Cam. Pose (fix) Cam. Pose (nonfix) Layout Matching Semantic Segm. Jigsaw In-Painting Colorization Random Proj. Task-Specific Object Class (100) Autoencoding Scene Class Object Class (100) Colorization Curvature Denoising Occlusion Edges 2D Edges Egomotion Cam. Pose (fix) In-Painting Jigsaw 2D Keypoint 3D Keypoint Cam. Pose (nonfix) Matching Random Proj. Reshading Z-Depth Distance Nor Curvature Occlusion Edges Egomotion Cam. Pose (fix) 2D Keypoint Layout Matching 2D Segm. Distance 2.5D Segm. Z-Depth Normals 3D Keypoint Denoising 2D Edges Cam. Pose (nonfix) Reshading Semantic Segm. Vanishing Pts. Object Class. (1000) Object Class. (1000) Figure 7: First-order task affinity matrix before (left) and a Analytic Hierarchy Process (AHP) normalization. Lower me transfered. For visualization, we use standard affinity-distan dist = e−β·P (where β = 20 and e is element-wise matrix ex See supplementary material for the full matrix with higher-orde the relative importance of each target task and i sp the relative cost of acquiring labels for each task. The BIP is parameterized by a vector x where ea fer and each task is represented by a binary variable cates which nodes are picked to be source and whi fers are selected. The canonical form for a BIP is: maximize cT x , subject to Ax b and x ∈ {0, 1}|E|+|V| . Each element ci for a transfer is the product of portance of its target task and its transfer performa ci := rtarget(i) · pi . Hence, the collective performance on all targets is mation of their individual AHP performance, pi, w by the user specified importance, ri. タスク重要度 AHPスコア and 3つの制約整数計画を解いてエッジを張る Fig: [Zamir+ ‘18] Figure 2 Fig: [Zamir+ ‘18] Figure 4 松井 (名古屋大) 転移学習の基礎転移学習の方法 37 / 54

事前学習済みモデルの転移可能性を評価する例: Task2Vec [Achille+ ’19] § 訓練済みNN: § タスクに対するNNの重みの重要性をKLで測る: 2次近似
摂動させたパラメータ⼊⼒の経験分布ベクトル化（Task2Vec） • の対⾓成分のみ取り出す • 同⼀フィルタの全ての重みについて平均をとる Task Embeddings embedding across a large library of tasks (best seen magnified). (Left) T-SNE visualization of the embed- tracted from the iNaturalist, CUB-200, iMaterialist datasets. Colors indicate ground-truth grouping of tasks mic or semantic types. Notice that the bird classification tasks extracted from CUB-200 embed near the bird sk from iNaturalist, even though the original datasets are different. iMaterialist is well separated from iNat- ails very different tasks (clothing attributes). Notice that some tasks of similar type (such as color attributes) but attributes of different task types may also mix when the underlying visual semantics are correlated. For ks of jeans (clothing type), denim (material) and ripped (style) recognition are close in the task embedding. visualization of the domain embeddings (using mean feature activations) for the same tasks. Domain em- tinguish iNaturalist tasks from iMaterialist tasks due to differences in the two problem domains. However, bute tasks on iMaterialist all share the same domain and only differ in their labels. In this case, the domain apse to a region without recovering any sensible structure. neric model trained on ImageNet and ob- ground-truth optimal selection. We discuss original output distribution pw(y|x) and the perturbed one pw0 (y|x). To second-order approximation, this is Domain Embeddings Figure 1: Task embedding across a large library of tasks (best seen magnified). (Left) T-SNE visualization of t ding of tasks extracted from the iNaturalist, CUB-200, iMaterialist datasets. Colors indicate ground-truth groupi based on taxonomic or semantic types. Notice that the bird classification tasks extracted from CUB-200 embed ne classification task from iNaturalist, even though the original datasets are different. iMaterialist is well separated uralist, as it entails very different tasks (clothing attributes). Notice that some tasks of similar type (such as color cluster together but attributes of different task types may also mix when the underlying visual semantics are corr example, the tasks of jeans (clothing type), denim (material) and ripped (style) recognition are close in the task e (Right) T-SNE visualization of the domain embeddings (using mean feature activations) for the same tasks. D bedding can distinguish iNaturalist tasks from iMaterialist tasks due to differences in the two problem domains. the fashion attribute tasks on iMaterialist all share the same domain and only differ in their labels. In this case, t embeddings collapse to a region without recovering any sensible structure. fine-tuning a generic model trained on ImageNet and ob- taining close to ground-truth optimal selection. We discuss original output distribution pw(y|x) and the per pw0 (y|x). To second-order approximation, this is ← Task2Vecによる埋め込みをT-SNEで可視化（⾊分けはタスクの真の分類を表す）同じタスクをNNの活性化出⼒間の共分散で埋め込んだものをT-SNEで可視化（⾊分けはタスクの真の分類を表す） → Fig: [Achille+ ‘19] Figure 1 Fisher情報⾏列 • 同時分布に対して特定のパラメータが持つ情報量を表す • あるタスクの性能がパラメータに強く依存しない場合，対応するの要素は⼩さくなる松井 (名古屋大) 転移学習の基礎転移学習の方法 38 / 54

事前学習済みモデルの転移可能性を評価する例: TransRate [Huang+ ’22] • 問題設定 • 元ドメインから⽬標ドメイン
への転移．Cクラスの分類問題 • 転移対象：事前学習モデル．元ドメインのデータへのアクセスは認めない（転移しない） • 事前学習モデル：（L層の特徴抽出器と1層の識別器） • 転移する層の数： i.e. pre-trained feature extractor を⽬標ドメインに転移 • ⽬標タスクのデータ：最適な⽬標ドメインモデル・・・⼊⼒事前学習モデルの1~K層・・・事前学習モデルと同じ構造の(K+1)~L層識別層出⼒ We consider the knowledge transfer from a source task Ts to a target task Tt of C-category classification. widely accepted, only the model that is pre-trained on the source task, instead of source data, is access The pre-trained model, denoted by F = fL+1 ... (f2 f1), consists of an L-layer feature extractor an 1-layer classifier fL+1 . Here, fl is the mapping function at the l-th layer. The target task is represented n labeled data samples {(xi, yi)}n i=1 . Afterwards, we denote the number of layers to be transferred by (K L). These K layers of the model are named as the pre-trained feature extractor g =fK ... (f2 The feature of xi extracted by g is denoted as zi =g(xi). Building on the feature extractor, we const the target model denoted by w to include 1) the same structure as the (K + 1)-th to (L)-th layers of source model and 2) a new classifier ft L+1 for the target task. We also refer w as the head of the target mo Following the standard practice of fine-tuning, both the feature extractor g and the head w will be trained the target task. We consider the optimal model for the target task as g ⇤ , w ⇤ = arg max ˜ g2G,w2W L(˜ g, w)= arg max ˜ g2G,w2W 1 n n X i=1 log p(yi|zi; ˜ g, w) subject to ˜ g(0) = g, where L denotes the log-likelihood, and G and W are the spaces of all possible feat extractors and heads, respectively. We define the transferability as the expected log-likelihood of the opti model w ⇤ g ⇤ on test samples in the target task: Definition 1 (Transferability). The transferability of a pre-trained feature extractor g from a source task to a target task Tt , denoted by TrfTs !Tt (g), is measured by the expected log-likelihood of the optimal m w ⇤ g ⇤ on a random test sample (x, y) of Tt : TrfTs !Tt (g) := [log p(y|z ⇤; g ⇤ , w ⇤)] where z ⇤ = g ⇤(x). This definition of transferability can be used for 1) selection of a pre-trained feature extractor am a model zoo {gm}M m=1 for a target task, where M pre-trained models could be in different architectu and trained on different source tasks in a supervised or unsupervised manner, and 2) selection of a laye transfer among all configurations {gl m}K l=1 given a pre-trained model gm and a target task. 構造対数尤度 Def 1 (Transferability) = 最適な⽬標ドメインモデルの⽬標ドメインのデータ分布に関する期待対数尤度元ドメインの事前学習済み特徴抽出器の⽬標ドメインへのTransferability § model zoo (各モデルは異なる構造で異なるタスクで事前学習されていても良い) からの特徴抽出器選択の指標に使える § 事前学習モデルのどの層を転移するか選ぶ指標に使える松井 (名古屋大) 転移学習の基礎転移学習の方法 39 / 54

事前学習と fine-tuning のスケーリング則 synthetic data から real data への転移（syn2real）タスクは ▶
syn と real の類似度が大きい → 事前学習の効果が高い（事前学習データを増やすと性能が上がる） ▶ syn と real の類似度が小さい → 事前学習の効果が低い（事前学習データを増やすことに意味はない）この現象を示唆する転移学習版 scaling 則（データサイズと汎化誤差の間の関係）はあるか？松井 (名古屋大) 転移学習の基礎転移学習の方法 40 / 54

事前学習と fine-tuning のスケーリング則 [Mikami+ 2022] 事前学習と fine-tuning の汎化誤差への影響がデータ数に応じてどう変化するかを表す scaling
則を導出 § n を事前学習のデータサイズ，s を fine-tuning のデータサイズとしたとき，経験的に以下が成り⽴つ § fine-tuning 時のデータサイズ s が固定のときの scaling 則 § NTKに基づく汎化誤差の理論的な bound から同様の scaling 則を導出（上の経験版とも整合する） 1 L(n, s) = 0. 1 L(n, s) = const. = Bs . gest the dependency of n is embedded in the coefficient B = g(n), i.e., the ng effects interact multiplicatively. To satisfy Requirement 2, a reasonable effect is g(n) = n ↵ + ; the error decays polynomially with respect to n combining these, we obtain L(n, s) = ( + n ↵)s , (2) rates for pre-training and fine-tuning, respectively, 0 is a constant, and DUCTION OF SCALING LAW -tuning error from a purely theoretical point of view. To incorporate the is given as an initialization, we need to analyze the test error during the ning algorithm such as SGD. We apply the recent development by Nitanda er learning. The study successfully analyzes the generalization of neural 汎化誤差事前学習による減衰率 fine-tuning による減衰率転移ギャップ 0 0 1 output noise for brevity; the task types are identical sharing the same input-output form ilarity is controlled by 1 . yze the situation where the effect of pre-training remains in the fine-tuning even for (s ! 1). More specifically, the theoretical analysis assumes a regularization term nce between the weights and the initial values, and a smaller learning rate than const uning. Hence we control how the pre-training effect is preserved through the regulari ning rate. Other assumptions made for theoretical analysis concern the model and lea m; a two-layer neural network having M hidden units with continuous nonlinear activ ed; for optimization, the averaged SGD (Polyak & Juditsky, 1992), an online algorit a technical reason. owing is an informal statement of the theoretical result. See Appendix E for detail ze that our result holds not only for syn2real transfer but also for transfer learning in g m 1 (Informal). Let ˆ fn,s(x) be a model of width M pre-trained by n sa . . . , (xn , yn) and fine-tuned by s samples (x0 1 , y0 1 ), . . . , (x0 s , y0 s ) where inputs x, x0 ⇠ with the input distribution p(x) and y = 0(x) and y0 = '(x0) = 0(x0) + 1(x0). ralization error of the squared loss L(n, s) = | ˆ fn,s(x) '(x)|2 is bounded from abov bability as Ex L(n, s)  A1(cM + A0 n ↵)s + "M . c can be arbitrary small for large M; A and A are constants; the exponents ↵ e object detection networks for autonomous driving developed by Tesla w images generated by simulation (Karpathy, 2021). e of syn2real transfer depends on the similarity between synthetic and re e similar they are, the stronger the effect of pre-training will be. On th ficant gap, increasing the number of synthetic data may be completely aste time and computational resources. A distinctive feature of syn2real process of generating data by ourselves. If a considerable gap exists, w e data with a different setting. But how do we know that? More specifi g setting without transfer, a “power law”-like relationship called a scaling ata size and generalization errors (Rosenfeld et al., 2019; Kaplan et al. for pre-training? find that the generalization error on fine-tuning is explained by a simple s test error ' Dn ↵ + C, D > 0 and pre-training rate ↵ > 0 describe the convergence speed of pr p C 0 determines the lower limit of the error. We can predict how should be to achieve the desired accuracy by estimating the parameters ults. Additionally, we analyze the dynamics of transfer learning using s based on the neural tangent kernel (Nitanda & Suzuki, 2021) and confir 松井 (名古屋大) 転移学習の基礎転移学習の方法 41 / 54

教師なしドメイン適応の学習理論

教師なしドメイン適応における典型的なリスク上界教師なしドメイン適応 ▶ 元ドメインのラベルありデータ DS = {(xS i , yS
i )}nS i=1 ▶ 目標ドメインのラベルなしデータ DT = {xT j }nT j=1 ▶ 同質的なドメインシフト: XS = XT , PXS ̸= PXT の下で, 目標ドメインの期待リスクを最小にする仮説を学習: h∗ T = arg min h∈H RT (h) = E(x,y)∼PXT ×YT [ℓ(y, h(x))] 典型的な RT (h) の上界の形 RT (h) ≤ RS(h) + Disc(PXT , PXS ) + Diﬀ(fT , fS) ▶ fT , fS : 目標ドメイン, 元ドメインの真の出力関数 ▶ Disc : 元ドメインと目標ドメインの周辺分布の不一致度 ▶ Diﬀ : 元ドメインと目標ドメインのラベル関数の違い松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 42 / 54

例 I H-ダイバージェンスに基づく RT (h) の上界 2 値判別を考える H-ダイバージェンス
dH∆H(PS X , PT X ) := 2 sup g∈H∆H |PS X (Ig) − PT X (Ig)| ▶ H∆H = {g = h ⊕ h′ | h, h′ ∈ H} (⊕ は排他的論理和) ▶ x ∈ Ih ⇔ h(x) = 1 ▶ H の VC 次元が d のとき, dH∆H ≤ ˆ dH∆H + O( d/n) 定理 [Ben-David+ 2010] 任意の h ∈ H に対して, 以下が成立 RT (h) ≤ RS(h) + 1 2 dH∆H(PXT , PXS ) + min h∈H (RS(h) + RT (h)) 松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 43 / 54

例 I H-ダイバージェンスに基づく RT (h) の上界 (証明) 損失関数の三角不等式から RT (h)
= RT (h, fT ) ≤ RT (h∗, fT ) + RT (h, h∗) ≤ RT (h∗) + RS(h, h∗) + |RT (h, h∗) − RS(h, h∗)| ≤ RT (h∗) + RS(h) + RS(h∗) + 1 2 dH∆H(PXT , XS) = RS(h) + 1 2 dH∆H(PXT , PXS ) + min h∈H (RS(h) + RT (h)) 2 ▶ 第 3 項は同時誤差などと呼ばれ, これを達成する仮説 h∗ は理想的な同時仮説と呼ばれる ▶ 一般に同時誤差は小さいとは限らず, また目標ドメインのラベル情報がないと推定できない (上界が緩い可能性) ▶ H ダイバージェンスが 0-1 損失に依存している (限定的) 松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 44 / 54

例 II Wasserstein 距離に基づく RT (h) の上界 1 次 Wasserstein
距離 (再掲) W1(PS X , PT X ) = inf π∈Π(PS X ,PT X ) X×X ∥x, x′∥dπ(x, x′) 仮定 ▶ ℓ(y, y′) = |y − y′| ▶ 仮説 h ∈ H は K-リプシッツ連続定理 [Shen+ 2018] RT (h) ≤ RS(h) + 2KW1(PT X , PS X ) + min h∈H (RS(h) + RT (h)) 松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 45 / 54

例 II Wasserstein 距離に基づく RT (h) の上界 (証明) h∗ =
arg min h∈H (RS(h) + RT (h)) とする RT (h) ≤ RT (h∗) + RT (h∗, h) ≤ RT (h∗) + RS(h, h∗) + RT (h, h∗) − RS(h, h∗) ≤ RT (h∗) + RS(h, h∗) + 2KW1(PXT , PXS ) ≤ RT (h∗) + RS(h) + RS(h∗) + 2KW1(PXT , PXS ) = RS(h) + 2KW1(PXT , PXS ) + min h∈H (RS(h) + RT (h)) 2 ▶ 2 行目 ∼3 行目: h, h′ が K-Lip のとき |h − h′| は 2K-Lip であることと K-R 双対性を使う RT (h, h′) − RS (h, h′) = EPXT [h(x) − h′(x)] − EPXS [h(x) − h′(x)] ≤ sup f:2K-Lip EPXT [f(x)] − EPXS [f(x)] = 2KW1 (PXT , PXS ) 松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 46 / 54

教師なしドメイン適応の必要条件 Most Common Assumptions ▶ 共変量シフト PS(Y | X) =
PT (Y | X) ▶ 元ドメインと目標ドメインの入力の周辺分布の discrepancy が小さいこと Disc(PS X , PT X ) : small ▶ 両方のドメインで誤差が小さくなるような共通の仮説が存在すること λH = Diﬀ(fT , fS) = minh∈H RS(h) + RT (h) : small 3 つの条件のうち何が真に必要条件になっているのか？松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 47 / 54

準備 i Domain Adaptation Learner Definition 2 (Domain Adaptation Learner)
A : ∞ m=1 ∞ n=1 (X × {0, 1})m × Xn → {0, 1}X ▶ ドメイン適応アルゴリズムを写像として記述したもの ▶ “元ドメインのラベルありデータと目標ドメインのラベルなしデータを入力すると, ある仮説を出力する” という写像松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 48 / 54

準備 ii Learnability Definition 3 (Learnability) A (ε, δ, m,
n)-learns PT from PS relative to H :⇐⇒ Pr S∼i.i.d(PS X×Y )m Tu∼i.i.d (PT X )n [RT (A (S, Tu)) ≤ RT (H) + ε] ≥ 1 − δ ▶ RT (H) = infh∈H RT (h) ▶ S : 元ドメインのサイズ m のラベルありデータ ▶ Tu : 目標ドメインのサイズ n のラベルなしデータデータの生成分布に対して, A が学習した仮説の期待リスク ≤ H で達成可能な最小の期待リスク + ε が 1 − δ 以上の確率で成り立つこと松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 49 / 54

Necessity Theorem i [Ben-David+ 2010, Redko+ 2019] Theorem 1 (H∆H
が小さいことの必要性) 任意の ε > 0 に対して, ある元ドメイン上の同時分布 PS0 X×Y と目標ドメイン上の同時分布 PT0 X×Y が存在して, 以下を満たす: 任意の domain adaptation learner A と任意の整数 m, n > 0 に対して, あるラベル関数 f : X → {0, 1} が存在して 1. λH < ε 2. PS0 X×Y と PT0 X×Y は共変量シフト条件を満たす 3. 期待リスクが確率 1 2 以上で大きい: Pr S∼i.i.d(PS0 X×Y )m Tu∼i.i.d ( PT0 X ) n RTf (A(S, Tu)) ≥ 1 2 ≥ 1 2 松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 50 / 54

Necessity Theorem ii [Ben-David+ 2010, Redko+ 2019] Theorem 2 (λH
が小さいことの必要性) このとき, 任意の ε > 0 に対して, ある元ドメイン上の同時分布 PS0 X×Y と目標ドメイン上の同時分布 PT0 X×Y が存在して, 以下を満たす: 任意の domain adaptation learner A と任意の整数 m, n > 0 に対して, あるラベル関数 f : X → {0, 1} が存在して 1. dH∆H(PS0 X , PT0 X ) < ε 2. PS0 X×Y と PT0 X×Y は共変量シフト条件を満たす 3. 期待リスクが確率 1 2 以上で大きい: Pr S∼i.i.d(PS0 X×Y )m Tu∼i.i.d ( PT0 X ) n RTf (A(S, Tu)) ≥ 1 2 ≥ 1 2 松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 51 / 54

定理の解釈共変量シフトの仮定があっても, ▶ 周辺分布の間の discrepancy が小さいこと ▶ ideal hypothesis が存在すること
のいずれかが欠ければ, 高い確率で期待リスクが大きくなってしまう松井 (名古屋大) 転移学習の基礎教師なしドメイン適応の学習理論 52 / 54

まとめ

本発表のまとめ ▶ 統計的機械学習の導入と従来の機械学習の困難 • データ, モデルの大規模化 • 本質的に小データな環境 ▶ 転移学習の基本概念
• 目標ドメインのリスク最小化としての定式化 • いつ (when), なにを (what), どうやって (how) 転移するか？事例転移: 重要度重み付き学習特徴転移: ドメイン敵対的学習パラメータ転移: 事前学習とファインチューニング ▶ 事前学習モデルの転移可能性とスケーリング則 ▶ 教師なしドメイン適応の理論 (期待リスク解析) 松井 (名古屋大) 転移学習の基礎まとめ 53 / 54

本発表で扱わなかった転移学習のトピックドメイン適応は転移学習のサブトピックであり, 転移学習は以下の問題を含む非常に大きなフレームワーク ▶ マルチタスク学習 (multi-task learning) ▶ 知識蒸留
(knowledge distillation) ▶ メタ学習 (meta learning) ▶ 少数ショット学習 (few-shot learning) ▶ ドメイン汎化 (domain generalization), 分布外汎化 (out-of-distribution generalization) ▶ 継続学習 (continual learning) ▶ 強化学習のための転移学習 (メタ強化学習を含む) 松井 (名古屋大) 転移学習の基礎まとめ 54 / 54

References [1] A. Achille et al. Task2vec: Task embedding for
meta-learning. In ICCV, pages 6430–6439, 2019. [2] A. R. Zamir et al. Taskonomy: Disentangling task transfer learning. In CVPR, pages 3712–3722, 2018. [3] H. Mikami et al. A scaling law for syn-to-real transfer: How much is your pre-training effective? In ECMP/PKDD, 2022. [4] H. Zhao et al. On learning invariant representations for domain adaptation, 2019. [5] I. Redko et al. Optimal transport for multi-source domain adaptation under target shift. AISTATS, 2019. [6] I. Sato et al. Managing computer-assisted detection system based on transfer learning with negative transfer inhibition. KDD, 2018. [7] J. Quionero-Candela et al. Dataset shift in machine learning. The MIT Press, 2009.

[8] J. Shen et al. Wasserstein distance guided representation learning
for domain adaptation. In AAAI, 2018. [9] K. Zhou et al. Domain generalization with mixstyle. ICLR, 2021. [10] L. Bernstein et al. Freely scalable and reconfigurable optical hardware for deep learning. Scientific reports, 11(1):1–12, 2021. [11] L. Duan et al. Learning with augmented features for heterogeneous domain adaptation. ICML, 2012. [12] L. Huang et al. Frustratingly easy transferability estimation. In ICML, 2022. [13] N. Courty et al. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2016. [14] S. Ben-David et al. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010. [15] S. Kuroki et al. Unsupervised domain adaptation based on source-guided discrepancy. In AAAI, 2019. [16] S. Rebuffi et al. Learning multiple visual domains with residual adapters. NeurIPS, 2017.

[17] T. Brown et al. Language models are few-shot learners.
In NeurIPS, 2020. [18] T. Kanamori et al. A least-squares approach to direct importance estimation. JMLR, 10:1391–1445, 2009. [19] Y. Ganin et al. Domain-adversarial training of neural networks. JMLR, 17(1):2096–2030, 2016. [20] Y. Huang et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In NeurIPS, 2019. [21] Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younes Bennani. Advances in domain adaptation theory. Elsevier, 2019. [22] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.

転移学習とドメイン適応の基礎

転移学習とドメイン適応の基礎

More Decks by kota matsui

Other Decks in Technology

Featured

Transcript