Policy-Adaptive Estimator Selection for Off-Policy Evaluation

Policy Adaptive Estimator Selection for Off-Policy Evaluation Takuma Udagawa, Haruka
Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara September 2022 Policy Adaptive Estimator Selection (PAS-IF) 1

Content • Introduction to Off-Policy Evaluation (OPE) • Estimator Selection
for OPE • Our proposal: Policy-Adaptive Estimator Selection via Importance Fitting (PAS-IF) • Synthetic Experiments • Estimator Selection • Policy Selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 2

Off-Policy Evaluation Motivation towards Estimator Selection September 2022 Policy Adaptive
Estimator Selection (PAS-IF) 3

Interactions in recommender systems A behavior policy interacts with users
and collects logged data. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 4 a user feedback (reward) a coming user (context) an item (action)

Interactions in recommender systems A behavior policy interacts with users
and collects logged data. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 5 a user feedback (reward) a coming user (context) an item (action) logged bandit feedback behavior policy 𝝅𝒃

Off-Policy Evaluation The goal is to evaluate the performance of
an evaluation policy 𝜋 𝑒 . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 6 offline A/B test logged bandit feedback behavior policy 𝝅𝒃 OPE estimator (policy performance)

Representative OPE estimators We aim to reduce both bias and
variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 7 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight)

variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 8 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high ✓ reward predictor

variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 9 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight

variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 10 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior

variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 11 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior

variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 12 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high control variate

To reduce the variance of IPS/ DR, many OPE estimators
have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/ DR) * [Metelli+,21] Advanced OPE estimators September 2022 Policy Adaptive Estimator Selection (PAS-IF) 13 * requires hyperparameter tuning of 𝜆, e.g., SLOPE [Su+,20b] [Tucker&Lee,21]

To reduce the variance of IPS/ DR, many OPE estimators
have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/DR) * [Metelli+,21] Advanced OPE estimators September 2022 Policy Adaptive Estimator Selection (PAS-IF) 14 * requires hyperparameter tuning, e.g., SLOPE [Su+,20b] Which OPE estimator should be used to enable an accurate OPE?

Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator
Selection (PAS-IF) 15 𝜋 𝑏 𝜋 𝑒

Selection (PAS-IF) 16 𝜋 𝑏 𝜋 𝑒 Estimator Selection is important!

Selection (PAS-IF) 17 𝜋 𝑏 Estimator Selection is important! but.. The best estimator can be different under different situations.

Selection (PAS-IF) 18 𝜋 𝑏 among the best Estimator Selection is important! but.. The best estimator can be different under different situations.

Selection (PAS-IF) 19 𝜋 𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations.

Selection (PAS-IF) 20 𝜋 𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.

Selection (PAS-IF) 21 𝜋 𝑏 Estimator Selection: How to identify the most accurate OPE estimator using only the available logged data? Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.

Estimator Selection for OPE September 2022 Policy Adaptive Estimator Selection
(PAS-IF) 22

Objective for estimator selection The goal is to identify the
most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 23

most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 24 true policy value (estimand)

most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 25 estimated from the logged data

Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged
data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 26

data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 27 pseudo-evaluation policy

data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 28 pseudo-evaluation policy OPE estimate on-policy policy value

data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 29 ※ 𝑆 is a set of random states for bootstrapping. pseudo-evaluation policy OPE estimate on-policy policy value

Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection
(PAS-IF) 30 Do these estimators really work well? non-adaptive heuristic (estimation) ෠ 𝑉 𝜋 𝐴 ; 𝐷 𝐵

(PAS-IF) 31 𝜋 𝑏 𝜋 𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) ෠ 𝑉 𝜋 𝐴 ; 𝐷 𝐵

(PAS-IF) 32 𝜋 𝑏 𝜋 𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) ෠ 𝑉 𝜋 𝐴 ; 𝐷 𝐵 How to choose OPE estimators adaptively to the given OPE task (e.g., evaluation policy)?

PAS-IF Policy Adaptive Estimator Selection via Importance Fitting September 2022
Policy Adaptive Estimator Selection (PAS-IF) 33

Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates
MSE using two datasets collected by A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 34 ~𝝅 𝒃 ~𝝅 𝑩 ~𝝅 𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy

MSE using two datasets collected by A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 35 ~𝝅 𝒃 ~𝝅 𝑩 ~𝝅 𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy behavior evaluation

MSE using two datasets collected by A/B tests. We aim to split the logged datasets adaptive to the given OPE task. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 36 ~𝝅 𝒃 ~𝝅 𝑩 ~𝝅 𝑨 pseudo-behavior policy pseudo-evaluation policy ~෥ 𝝅 𝒃 ~෥ 𝝅 𝒆 total amount of logged data

Subsampling function controls the pseudo-policies We now introduce a subsampling
function . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 37 ~𝝅 𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~෥ 𝝅 𝒃 ~෥ 𝝅 𝒆

Subsampling function controls the pseudo-policies We now introduce a subsampling
function . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 38 ~𝝅 𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~෥ 𝝅 𝒃 ~෥ 𝝅 𝒆

How to optimize the subsampling function? PAS-IF optimizes 𝜌 to
reproduce the bias-variance tradeoff of the original OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 39 Subsampling function

reproduce the bias-variance tradeoff of the original OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 40 Subsampling function

reproduce the bias-variance tradeoff of the original OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 41 Objective of importance fitting: Subsampling function

Key contribution of PAS-IF PAS-IF enables MSE estimation that are..
September 2022 Policy Adaptive Estimator Selection (PAS-IF) 42 Data Driven -> by splitting the logged data into pseudo datasets Adaptive -> by optimizing subsampling function to simulate the distribution shift of the original OPE task Accurate Estimator Selection! . ->

Synthetic Experiment September 2022 Policy Adaptive Estimator Selection (PAS-IF) 43

Experimental settings We compare PAS-IF and non-adaptive heuristic in two
tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 44

tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 45 hyperparam tuning* estimator selection 1. Estimator Selection * SLOPE [Su+,20b] [Tucker&Lee,21]

PAS-IF enables an accurate estimator selection PAS-IF enables far more
accurate estimator selection by being adaptive. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 46 PAS-IF is accurate across various evaluation policies lower, the better 𝜋 𝑏1 𝜋 𝑏2 𝜋 𝑏1 𝜋 𝑏2 ෝ 𝑚 -- selected 𝑚 ∗ -- true best

tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 47 hyperparam tuning estimator selection 1. Estimator Selection

tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 48 hyperparam tuning estimator selection 1. Estimator Selection ෠ 𝑉 1 ෠ 𝑉 2 ෠ 𝑉 3 PAS-IF different estimator for each policy non-adaptive ෠ 𝑉 universal estimator for all policies ෠ 𝑉 ෠ 𝑉

Moreover, PAS-IF also benefits policy selection PAS-IF also reveals a
favorable result in the policy selection task. PAS-IF can identify better policies among many candidates by using different (appropriate) estimator for each policy! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 49 lower, the better ො 𝜋 -- selected 𝜋 ∗ -- true best

Summary • Estimator Selection is important to enable an accurate
OPE. • Non-adaptive heuristic fails to be adaptive to the given OPE task. • PAS-IF enables an adaptive and accurate estimator selection by subsampling and optimizing the pseudo OPE datasets. PAS-IF will help identify an accurate OPE estimator in practice! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 50

Thank you for listening! Feel free to ask any questions,
and discussions are welcome! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 51

Example case of importance fitting When we have ⇒ PAS-IF
can produce a similar distribution shift! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 52 Note: the simplified case of .

Detailed optimization procedure of PAS-IF We optimize the subsampling rule
𝜌 𝜃 via gradient decent. To maintain the similar data size with the original OPE task, PAS-IF also imposes the regularization on the data size. We tune 𝜆 so that . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 53

Key idea of PAS-IF How about sampling the logged data
and constructing a pseudo-evaluation policy that has a bias-variance tradeoff similar to the given OPE task? September 2022 Policy Adaptive Estimator Selection (PAS-IF) 54 (𝑆 is a set of random states for bootstrapping)

References September 2022 Policy Adaptive Estimator Selection (PAS-IF) 55

References (1/4) [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset
Tree for Learning with Partial Labels.” KDD, 2009. [Precup+,00] Doina Precup, Richard S. Sutton, and Satinder Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Dudík+,14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 September 2022 Policy Adaptive Estimator Selection (PAS-IF) 56

References (2/4) [Swaminathan&Joachims,15] Adith Swaminathan and Thorsten Joachims. “The Self-
Normalized Estimator for Counterfactual Learning.” NeurIPS, 2015. https://dl.acm.org/doi/10.5555/2969442.2969600 [Wang+,17] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. “Optimal and Adaptive Off-policy Evaluation in Contextual Bandits.” ICML, 2017. https://arxiv.org/abs/1612.01205 [Metelli+,21] Alberto M. Metelli, Alessio Russo, Marcello Restelli. “Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning.” NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6b a23a-Abstract.html September 2022 Policy Adaptive Estimator Selection (PAS-IF) 57

References (3/4) [Su+,20a] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and
Miroslav Dudík. “Doubly Robust Off-policy Evaluation with Shrinkage.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Su+,20b] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Tucker&Lee, 21] George Tucker and Jonathan Lee. “Improved Estimator Selection for Off-Policy Evaluation.” 2021. https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf [Narita+,21] Yusuke Narita, Shota Yasui, and Kohei Yata. ”Debiased Off-Policy Evaluation for Recommendation Systems.” RecSys, 2021. https://arxiv.org/abs/2002.08536 September 2022 Policy Adaptive Estimator Selection (PAS-IF) 58

References (4/4) [Saito+,21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and
Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Saito+,21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 September 2022 Policy Adaptive Estimator Selection (PAS-IF) 59

Policy-Adaptive Estimator Selection for Off-Pol...

Policy-Adaptive Estimator Selection for Off-Policy Evaluation

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript