Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Policy-Adaptive Estimator Selection for Off-Policy Evaluation

Haruka Kiyohara
September 22, 2022

Policy-Adaptive Estimator Selection for Off-Policy Evaluation

AAAI2023
arXiv: https://arxiv.org/abs/2211.13904

CONSEQUENCES+REVEAL WS @ RecSys2022 (Day2, CONSEQUENCES)
About WS: https://sites.google.com/view/consequences2022

CFML勉強会#7
https://cfml.connpass.com/event/264017/

RecSys読み会2022
https://connpass.com/event/261571/

Haruka Kiyohara

September 22, 2022
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Policy Adaptive Estimator Selection for Off-Policy Evaluation Takuma Udagawa, Haruka

    Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara September 2022 Policy Adaptive Estimator Selection (PAS-IF) 1
  2. Content • Introduction to Off-Policy Evaluation (OPE) • Estimator Selection

    for OPE • Our proposal: Policy-Adaptive Estimator Selection via Importance Fitting (PAS-IF) • Synthetic Experiments • Estimator Selection • Policy Selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 2
  3. Interactions in recommender systems A behavior policy interacts with users

    and collects logged data. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 4 a user feedback (reward) a coming user (context) an item (action)
  4. Interactions in recommender systems A behavior policy interacts with users

    and collects logged data. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 5 a user feedback (reward) a coming user (context) an item (action) logged bandit feedback behavior policy 𝝅𝒃
  5. Off-Policy Evaluation The goal is to evaluate the performance of

    an evaluation policy 𝜋 𝑒 . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 6 offline A/B test logged bandit feedback behavior policy 𝝅𝒃 OPE estimator (policy performance)
  6. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 7 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight)
  7. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 8 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high ✓ reward predictor
  8. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 9 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight
  9. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 10 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior
  10. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 11 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior
  11. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 12 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high control variate
  12. To reduce the variance of IPS/ DR, many OPE estimators

    have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/ DR) * [Metelli+,21] Advanced OPE estimators September 2022 Policy Adaptive Estimator Selection (PAS-IF) 13 * requires hyperparameter tuning of 𝜆, e.g., SLOPE [Su+,20b] [Tucker&Lee,21]
  13. To reduce the variance of IPS/ DR, many OPE estimators

    have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/DR) * [Metelli+,21] Advanced OPE estimators September 2022 Policy Adaptive Estimator Selection (PAS-IF) 14 * requires hyperparameter tuning, e.g., SLOPE [Su+,20b] Which OPE estimator should be used to enable an accurate OPE?
  14. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection (PAS-IF) 16 𝜋 𝑏 𝜋 𝑒 Estimator Selection is important!
  15. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection (PAS-IF) 17 𝜋 𝑏 Estimator Selection is important! but.. The best estimator can be different under different situations.
  16. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection (PAS-IF) 18 𝜋 𝑏 among the best Estimator Selection is important! but.. The best estimator can be different under different situations.
  17. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection (PAS-IF) 19 𝜋 𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations.
  18. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection (PAS-IF) 20 𝜋 𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.
  19. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection (PAS-IF) 21 𝜋 𝑏 Estimator Selection: How to identify the most accurate OPE estimator using only the available logged data? Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.
  20. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 23
  21. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 24 true policy value (estimand)
  22. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 25 estimated from the logged data
  23. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 26
  24. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 27 pseudo-evaluation policy
  25. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 28 pseudo-evaluation policy OPE estimate on-policy policy value
  26. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 29 ※ 𝑆 is a set of random states for bootstrapping. pseudo-evaluation policy OPE estimate on-policy policy value
  27. Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection

    (PAS-IF) 30 Do these estimators really work well? non-adaptive heuristic (estimation) ෠ 𝑉 𝜋 𝐴 ; 𝐷 𝐵
  28. Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection

    (PAS-IF) 31 𝜋 𝑏 𝜋 𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) ෠ 𝑉 𝜋 𝐴 ; 𝐷 𝐵
  29. Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection

    (PAS-IF) 32 𝜋 𝑏 𝜋 𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) ෠ 𝑉 𝜋 𝐴 ; 𝐷 𝐵 How to choose OPE estimators adaptively to the given OPE task (e.g., evaluation policy)?
  30. Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates

    MSE using two datasets collected by A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 34 ~𝝅 𝒃 ~𝝅 𝑩 ~𝝅 𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy
  31. Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates

    MSE using two datasets collected by A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 35 ~𝝅 𝒃 ~𝝅 𝑩 ~𝝅 𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy behavior evaluation
  32. Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates

    MSE using two datasets collected by A/B tests. We aim to split the logged datasets adaptive to the given OPE task. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 36 ~𝝅 𝒃 ~𝝅 𝑩 ~𝝅 𝑨 pseudo-behavior policy pseudo-evaluation policy ~෥ 𝝅 𝒃 ~෥ 𝝅 𝒆 total amount of logged data
  33. Subsampling function controls the pseudo-policies We now introduce a subsampling

    function . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 37 ~𝝅 𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~෥ 𝝅 𝒃 ~෥ 𝝅 𝒆
  34. Subsampling function controls the pseudo-policies We now introduce a subsampling

    function . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 38 ~𝝅 𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~෥ 𝝅 𝒃 ~෥ 𝝅 𝒆
  35. How to optimize the subsampling function? PAS-IF optimizes 𝜌 to

    reproduce the bias-variance tradeoff of the original OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 39 Subsampling function
  36. How to optimize the subsampling function? PAS-IF optimizes 𝜌 to

    reproduce the bias-variance tradeoff of the original OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 40 Subsampling function
  37. How to optimize the subsampling function? PAS-IF optimizes 𝜌 to

    reproduce the bias-variance tradeoff of the original OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 41 Objective of importance fitting: Subsampling function
  38. Key contribution of PAS-IF PAS-IF enables MSE estimation that are..

    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 42 Data Driven -> by splitting the logged data into pseudo datasets Adaptive -> by optimizing subsampling function to simulate the distribution shift of the original OPE task Accurate Estimator Selection! . ->
  39. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 44
  40. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 45 hyperparam tuning* estimator selection 1. Estimator Selection * SLOPE [Su+,20b] [Tucker&Lee,21]
  41. PAS-IF enables an accurate estimator selection PAS-IF enables far more

    accurate estimator selection by being adaptive. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 46 PAS-IF is accurate across various evaluation policies lower, the better 𝜋 𝑏1 𝜋 𝑏2 𝜋 𝑏1 𝜋 𝑏2 ෝ 𝑚 -- selected 𝑚 ∗ -- true best
  42. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 47 hyperparam tuning estimator selection 1. Estimator Selection
  43. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 48 hyperparam tuning estimator selection 1. Estimator Selection ෠ 𝑉 1 ෠ 𝑉 2 ෠ 𝑉 3 PAS-IF different estimator for each policy non-adaptive ෠ 𝑉 universal estimator for all policies ෠ 𝑉 ෠ 𝑉
  44. Moreover, PAS-IF also benefits policy selection PAS-IF also reveals a

    favorable result in the policy selection task. PAS-IF can identify better policies among many candidates by using different (appropriate) estimator for each policy! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 49 lower, the better ො 𝜋 -- selected 𝜋 ∗ -- true best
  45. Summary • Estimator Selection is important to enable an accurate

    OPE. • Non-adaptive heuristic fails to be adaptive to the given OPE task. • PAS-IF enables an adaptive and accurate estimator selection by subsampling and optimizing the pseudo OPE datasets. PAS-IF will help identify an accurate OPE estimator in practice! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 50
  46. Thank you for listening! Feel free to ask any questions,

    and discussions are welcome! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 51
  47. Example case of importance fitting When we have ⇒ PAS-IF

    can produce a similar distribution shift! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 52 Note: the simplified case of .
  48. Detailed optimization procedure of PAS-IF We optimize the subsampling rule

    𝜌 𝜃 via gradient decent. To maintain the similar data size with the original OPE task, PAS-IF also imposes the regularization on the data size. We tune 𝜆 so that . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 53
  49. Key idea of PAS-IF How about sampling the logged data

    and constructing a pseudo-evaluation policy that has a bias-variance tradeoff similar to the given OPE task? September 2022 Policy Adaptive Estimator Selection (PAS-IF) 54 (𝑆 is a set of random states for bootstrapping)
  50. References (1/4) [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset

    Tree for Learning with Partial Labels.” KDD, 2009. [Precup+,00] Doina Precup, Richard S. Sutton, and Satinder Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Dudík+,14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 September 2022 Policy Adaptive Estimator Selection (PAS-IF) 56
  51. References (2/4) [Swaminathan&Joachims,15] Adith Swaminathan and Thorsten Joachims. “The Self-

    Normalized Estimator for Counterfactual Learning.” NeurIPS, 2015. https://dl.acm.org/doi/10.5555/2969442.2969600 [Wang+,17] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. “Optimal and Adaptive Off-policy Evaluation in Contextual Bandits.” ICML, 2017. https://arxiv.org/abs/1612.01205 [Metelli+,21] Alberto M. Metelli, Alessio Russo, Marcello Restelli. “Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning.” NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6b a23a-Abstract.html September 2022 Policy Adaptive Estimator Selection (PAS-IF) 57
  52. References (3/4) [Su+,20a] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and

    Miroslav Dudík. “Doubly Robust Off-policy Evaluation with Shrinkage.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Su+,20b] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Tucker&Lee, 21] George Tucker and Jonathan Lee. “Improved Estimator Selection for Off-Policy Evaluation.” 2021. https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf [Narita+,21] Yusuke Narita, Shota Yasui, and Kohei Yata. ”Debiased Off-Policy Evaluation for Recommendation Systems.” RecSys, 2021. https://arxiv.org/abs/2002.08536 September 2022 Policy Adaptive Estimator Selection (PAS-IF) 58
  53. References (4/4) [Saito+,21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and

    Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Saito+,21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 September 2022 Policy Adaptive Estimator Selection (PAS-IF) 59