Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction（日本語版）

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction (⽇本語版)
Haruka Kiyohara, Masahiro Nomura, Yuta Saito 清原明加（Haruka Kiyohara） https://sites.google.com/view/harukakiyohara April 2024 OPE for slate bandits with abstraction @ WWW2024 1

“slate”と呼ばれる組合せによるアイテム推薦広告やファッションEC、医療では、“slate”と呼ばれる組合せを最適化。 April 2024 OPE for slate bandits with abstraction
@ WWW2024 2 スマートデバイスで快適な住居を整える今なら20%オフ! 意思決定 1: キャッチコピーログデータを使って “slate” 方策の性能をオフライン評価したい！意思決定 2: イメージ図意思決定 3: 割引率

データ生成過程とオフライン評価の目標 • ⽂脈情報 (context) 𝑥 ∈ 𝑋 • ユーザーの属性情報など •
組合せ意思決定 (slate) 𝑠 = 𝑎1 , 𝑎2 , ⋯ , 𝑎𝐿 ∈ 𝑆 = ∑!∈[$] 𝐴𝑙 • 各意思決定 (slot action) 𝑎𝑙 ∈ 𝐴𝑙 • 例えばメール配信最適化の場合、 𝐴1 は件名の候補、𝐴2 は挿⼊する画像の候補など。 • 報酬 (reward) 𝑟 • クリックや購⼊ April 2024 OPE for slate bandits with abstraction @ WWW2024 3

• ⽂脈情報 (context) 𝑥 ∈ 𝑋 • ユーザーの属性情報など • 組合せ意思決定
(slate) 𝑠 = 𝑎1 , 𝑎2 , ⋯ , 𝑎𝐿 ∈ 𝑆 = ∑!∈[$] 𝐴𝑙 • 各意思決定 (slot action) 𝑎𝑙 ∈ 𝐴𝑙 • 例えばメール配信最適化の場合、 𝐴1 は件名の候補、𝐴2 は挿⼊する画像の候補など。 • 報酬 (reward) 𝑟 • クリックや購⼊データ生成過程とオフライン評価の目標 April 2024 OPE for slate bandits with abstraction @ WWW2024 4 オフ方策評価 (Off-Policy Evaluation) の⽬標は、⽅策価値を正確に推定すること。これを過去のデプロイ⽅策 (logging policy 𝜋0 ) が集めたデータを使って⾏いたい。

分布シフトと重点サンプリング [Strehl+,10] 𝜋0 と 𝜋 が異なるため、分布シフトを補正する必要がある。 April 2024 OPE for
slate bandits with abstraction @ WWW2024 5 evaluation logging slate A slate B 多い少ない少ない多い

slate bandits with abstraction @ WWW2024 6 evaluation logging slate A slate B ・不偏性多い少ない少ない多い

slate bandits with abstraction @ WWW2024 7 evaluation logging slate A ・不偏性・⼤きな分散 slot に対し slate は指数関数的に増加多い少ない

線形性の過程とPI推定量 [Swamminathan+,17] PI 推定量は slot 間での価値の線形性を仮定することで、分散を減らそうとする。 April 2024 OPE for
slate bandits with abstraction @ WWW2024 8 メール全体での期待報酬 = 件名の価値 + クーポンの価値 + 添付画像の価値 + …

線形性の過程とPI推定量 [Swamminathan+,17] PI 推定量は slot 間での価値の線形性を仮定することで、分散を減らそうとする。この時、推定量は以下のようになる。 April 2024 OPE
for slate bandits with abstraction @ WWW2024 9 重み (importance weight; iw) は slot毎の重みの総積ではなく、総和になっている

既存研究のまとめ • 重点サンプリング (Inverse Propensity Scoring; IPS) • 報酬に関する仮定はなし •
(利点) どんな報酬構造に対しても不偏性を満たす • (⽋点) 組合せ⾏動空間が⼤きくなることで、⾮常に⼤きな分散を発⽣する • 線形性 (PseudoInverse; PI) • 報酬は slot-action の価値の線形和として表される • (利点) 重みが線形和になり、IPSより分散が⼩さい • (⽋点) 線形性の強い仮定をおく必要があり、偏りを⽣じる報酬の構造に関する仮定を置かずに IPS よりも分散を減らすことはできる? April 2024 OPE for slate bandits with abstraction @ WWW2024 10

Our proposal: Latent IPS April 2024 OPE for slate bandits
with abstraction @ WWW2024 11

アイデア：潜在表現上で重点サンプリング slateの⾏動空間は⼤きいので、代わりに潜在表現を⽤い重点サンプリングする。 April 2024 OPE for slate bandits with abstraction
@ WWW2024 12 • slate間の類似性を考慮する • 報酬に関する仮定を置くことなくバリアンスを減らせる利点と特徴 encoder decoder slate の特徴量 slate の特徴量報酬予測潜在表現重点サンプリング

提案手法: Latent IPS (LIPS) 潜在表現 𝜙: 𝑆 ⟶ 𝑍 を使って、以下のように重点サンプリングを⾏う。
April 2024 OPE for slate bandits with abstraction @ WWW2024 13 潜在表現上での重点サンプリング

提案手法: Latent IPS (LIPS) 潜在表現 𝜙: 𝑆 ⟶ 𝑍 を使って、以下のように重点サンプリングを⾏う。
潜在表現が⽂脈依存の潜在表現分布を持つことを考えると、 April 2024 OPE for slate bandits with abstraction @ WWW2024 14 潜在表現上での重点サンプリング

LIPSの統計的性質 (1/3) • LIPS は以下の sufficient slate abstraction を使うと不偏性を満たす。 (Sufficient
slate abstraction) 潜在表現関数 𝜙𝜃 は同じ潜在表現を持つ⼆つの slate 𝑠 ∈ 𝑆, and 𝑠' ∈ 𝑠'' ∈ 𝑆 𝜙𝜃 𝑠 = 𝜙𝜃 (𝑠′′)} が同じ報酬 𝑞 𝑥, 𝑠 = 𝑞(𝑥, 𝑠′) を持つ時、潜在表現は “sufficient (⼗分)” であると定義する。 (例えば、one-hot encoding は “sufficient slate abstraction” の⼀つ) April 2024 OPE for slate bandits with abstraction @ WWW2024 15

LIPSの統計的性質 (2/3) • sufficient slate abstraction を使わなくても、偏りは以下のようになる。 April 2024 OPE
for slate bandits with abstraction @ WWW2024 16 ① ② ③

LIPSの統計的性質 (2/3) • sufficient slate abstraction を使わなくても、偏りは以下のようになる。 April 2024 OPE
for slate bandits with abstraction @ WWW2024 17 ① ② ③ ① 潜在表現から slate がどの程度判別できるか ② ⼆つの slates の間で期待報酬がどの程度異なるか ③ ⼆つの slates の間で元々の slate の重みがどの程度異なるか潜在表現の持つ情報量が⼤きいほど偏りは⼩さくなる

LIPSの統計的性質 (3/3) • LIPS は元々の重点サンプリングに⽐べ、分散を⼤幅に減少する。 April 2024 OPE for slate
bandits with abstraction @ WWW2024 18 ① ② ① 潜在表現で条件づけた時の報酬の分散 ② 潜在表現で条件づけた slate の選択確率 𝜋0 (𝑠|𝑥, 𝜙𝜃 (𝑠)) の分散分散減少量は潜在表現の持つ情報量が⼩さいほど⼤きくなる。

偏りと分散のトレードオフ潜在表現の持つ情報量の⼤きさが偏りと分散の関係に⼤きく影響を及ぼす。 April 2024 OPE for slate bandits with abstraction
@ WWW2024 19 情報量が少ない偏り分散情報量が多い

偏りと分散のトレードオフ潜在表現の持つ情報量の⼤きさが偏りと分散の関係に⼤きく影響を及ぼす。推定誤差を⼩さくするには、潜在表現の情報量をうまく調整する必要がある: April 2024 OPE for slate bandits with
abstraction @ WWW2024 20 情報量が少ない偏り分散情報量が多い

潜在表現をどのように学習するか？以下の損失関数を使うことで偏りと分散のトレードオフを調整する。 April 2024 OPE for slate bandits with abstraction
@ WWW2024 21 𝜷: パタメタ.

Experiments April 2024 OPE for slate bandits with abstraction @
WWW2024 22

実験設定 April 2024 OPE for slate bandits with abstraction @
WWW2024 23 • 2つのデータ: Wiki10-31K, Eurlex-4K • ⾮線形の報酬関数 (共起関係) (条件つき報酬) (重要なアイテム) ※ (𝑎1 , 𝑎2 , ⋯ , 𝑎 [𝐿/2]) は saficient slate abstraction の⼀つ。

比較手法 • Inverse Propensity Scoring (IPS) • PseudoInverse (PI) April
2024 OPE for slate bandits with abstraction @ WWW2024 24 • LIPS (w/ data-driven choice of 𝛽) • LIPS (w/ best 𝛽)

比較手法 • Inverse Propensity Scoring (IPS) • PseudoInverse (PI) •
Direct Method (DM) – 回帰を使った⼿法 [Beygelzimer&Langford,09] • Marginalized IPS – saficient slate abstraction (𝑎1 , 𝑎2 , ⋯ , 𝑎 [𝐿/2]) によるIPS [Saito&Joachims,22] April 2024 OPE for slate bandits with abstraction @ WWW2024 25 • LIPS (w/ data-driven choice of 𝛽) • LIPS (w/ best 𝛽)

評価指標と実験パラメタ評価指標は⼆乗誤差を使⽤： April 2024 OPE for slate bandits with abstraction
@ WWW2024 26 (50個のログデータの平均で計算)

評価指標と実験パラメタ評価指標は⼆乗誤差を使⽤：以下の条件を変えて推定量を⽐較する。 • データ数 (𝑛): {1000, 2000, 4000, 8000,
16000} • slateの要素数 (𝐿): {4, 6, 8, 10, 12} April 2024 OPE for slate bandits with abstraction @ WWW2024 27 value: デフォルト値 (50個のログデータの平均で計算)

データ数を変えた実験 April 2024 OPE for slate bandits with abstraction @
WWW2024 28 データ数が⼩さい時に、他の推定量と⽐べより正確に

slateの要素数を変えた実験 April 2024 OPE for slate bandits with abstraction @
WWW2024 29 LIPSは様々な条件 (報酬関数や slateの要素数) の元で正確に

まとめ • 組合せを伴う slate のオフ⽅策評価について研究。 • 既存推定量は slate の⼤きな⾏動空間によって発⽣する分散と、報酬に関する線形性の強い仮定による偏りの問題があった。
• LIPSは潜在表現上で重点サンプリングを⾏い分散を軽減し, さらに潜在表現の情報量を調整することで偏りと分散をどちらも⼩さくする。 • 実験でも、様々なデータ数やslateの要素数に対してLIPSが安定して正確なオフライン推定を⾏うことを確認。 April 2024 OPE for slate bandits with abstraction @ WWW2024 30

Appendix April 2024 OPE for slate bandits with abstraction @
WWW2024 31

DR推定量との比較 • Doubly Robust (DR) は回帰と重点サンプリングを両⽅使って推定。 April 2024 OPE for
slate bandits with abstraction @ WWW2024 32 分散減少のための制御変数 [Dudík+,14] [Vlassis+,21] [Saito+,23]

データ数を変えた実験 April 2024 OPE for slate bandits with abstraction @
WWW2024 33 データ数が⼩さい時に、他の推定量と⽐べより正確に

slateの要素数を変えた実験 April 2024 OPE for slate bandits with abstraction @
WWW2024 34 LIPSは様々な条件 (報酬関数や slateの要素数) の元で正確に

𝛽の値を変えた実験 April 2024 OPE for slate bandits with abstraction @
WWW2024 35 偏りと分散のトレードオフ: ・𝜷 が⼩さい時に偏りが⼩さい・𝜷 が⼤きい時に分散が⼩さい

References April 2024 OPE for slate bandits with abstraction @
WWW2024 36

References (1/2) [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, Lihong
Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Swamminathan+,17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. “Off-policy evaluation for slate recommendation.” NeurIPS, 2017. https://arxiv.org/abs/1605.04812 [Beygelzimer&Langford,09] Alina Beygelzimer, John Langford. “The Offset Tree for Learning with Partial Labels.” KDD, 2009. https://arxiv.org/abs/0812.4044 [Saito&Joachims,22] Yuta Saito, Thorsten Joachims. “Off-Policy Evaluation for Large Action Spaces via Embeddings.” ICML, 2022. https://arxiv.org/abs/2202.06317 [Dudík+,14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 April 2024 OPE for slate bandits with abstraction @ WWW2024 37

References (2/2) [Vlassis+,21] Nikos Vlassis, Ashok Chandrashekar, Fernando Amat Gil,
Nathan Kallus. “Control Variates for Slate Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/1003.0120 [Saito+,23] Yuta Saito, Qingyang Ren, Thorsten Joachims. “Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling.” ICML, 2023. https://arxiv.org/abs/1003.0120 April 2024 OPE for slate bandits with abstraction @ WWW2024 38

Off-Policy Evaluation of Slate Bandit Policies ...

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction（日本語版）

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript