Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

Best Arm Identification with a Fixed Budget und...

Avatar for MasaKat0 MasaKat0
April 20, 2025
36

Best Arm Identification with a Fixed Budget under a Smallย Gap

Allied Social Sciences Association (ASSA) 2023 Annual Meeting

Avatar for MasaKat0

MasaKat0

April 20, 2025
Tweet

More Decks by MasaKat0

Transcript

  1. Best Arm Identification with a Fixed Budget under a Small

    Gap Masahiro Kato (CyberAgent, Inc. / University of Tokyo) Session: Adaptive Experimental Design for Policy Choice and Policy Learning Kato, Ariu, Imaizumi, Nomura, and Qin (2022), https://arxiv.org/abs/2201.04469. 1
  2. Experimental Design for Better Decision-Making ร˜Keywords: Causal inference, decision-making, and

    experimental design. n Treatment arm (arm / treatment / policy). ex. drugs, advertisements, and economic policies. Each treatment arm has a potential outcome. By drawing an arm, we can observe the outcome. ร˜Economics: From treatment effect estimation to treatment choice (decision-making). โ€ข Treatment (policy) choice: Choose the best treatment arm (policy) using observations. cf. Manski (2000, 2002, 2004), Stoye (2009, 2012), Manski and Tetenov (2016). ร˜Multi-armed bandit problem: Optimize decision-making via adaptive experiments. โ€ข Regret minimization: Choose the treatment arms to maximize the cumulative reward during the experiment. cf. Gittins (1979), and Lai and Robbins (1985). โ€ข Best arm identification (BAI): Choose the best treatment arm after the experiment. cf. Bubeck et al. (2011), Kaufmann et al. (2016), and Kasy and Sautmann (2021). cf. Average treatment effect estimation via adaptive experimental design: van der Laan (2008), Hahn, Hirano, and Karlan (2011). 2
  3. BAI with a Fixed Budget ร˜Consider an adaptive experiment where

    we can draw a treatment arm in each round. Draw a treatment arm = allocate a treatment arm to an experimental unit and observe the realized outcome. ร˜BAI with a fixed budget. โ€ข The number of rounds of an adaptive experiments (budget / sample size) is predetermined. โ€ข Recommend the best arm from multiple candidates after the experiment. โ†” BAI with fixed confidence: continue adaptive experiments until a certain criterion is satisfied. cf. sequential experiments. ร˜Evaluation metrics: โ€ข Probability of misidentifying the best treatment arm. โ€ข Expected simple regret (difference between the expected outcomes of best and suboptimal arms). n Goal: recommend the best arm with smaller probability of misidentification or expected simple regret. 3
  4. Contents n Discuss asymptotically optimal algorithms in BAI with a

    fixed budget. โ€ข Consider a case with two treatment arms. ex. treatment and control groups. โ€ข Potential outcomes follow Gaussian distributions. โ€ข Minimization of the probability of misidentification. n Neyman allocation rule: โ€ข Draw a treatment arm with the ratio of the standard deviations of the potential outcomes. โ€ข When the standard deviations are known, the Neyman allocation is known to be optimal. cf. Chen et al. (2000), Glynn and Juneja (2004), and Kaufmann et al. (2016). 4
  5. Contents ร˜Today, I introduce our paper: Kato, Ariu, Imaizumi, Nomura,

    and Qin (2022), โ€œOptimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap.โ€ โ€ข The standard deviations are unknown and estimated in an adaptive experiment. โ€ข The worst-case asymptotic optimality of the Neyman allocation rule. n In addition to the above paper, I introduce several new findings in our project. โ€ข Beyond the Neyman allocation rule. โ€ข Minimization of the expected simple regret. 5 * * https://arxiv.org/abs/2201.04469.
  6. Optimal Best Arm Identification in Two-Armed Bandits with a Fixed

    Budget under a Small Gap Kato, Ariu, Imaizumi, Nomura, and Qin (2022) 6
  7. Problem Setting n Adaptive experiment with ๐‘ป rounds: T =

    {1,2, โ€ฆ , ๐‘‡}. n Binary treatment arms: {1,0}. โ€ข Each treatment arm ๐‘Ž โˆˆ 1,0 has a potential outcome ๐‘Œ! โˆˆ โ„. The distributions of (๐‘Œ", ๐‘Œ#) do not change across rounds, and ๐‘Œ" and ๐‘Œ# are independent. โ€ข At round ๐‘ก, by drawing a treatment arm ๐‘Ž โˆˆ 1,0 , we observe ๐‘Œ!,% , which is an iid copy of ๐‘Œ! . ร˜Definition: Two-armed Gaussian bandit models. โ€ข A class โ„ณ of joint distributions ๐œˆ (bandit models) of (๐‘Œ", ๐‘Œ#). โ€ข ๐‘Œ", ๐‘Œ# under ๐œˆ โˆˆ โ„ณ follow Gaussian distributions ๐’ฉ(๐œ‡", ๐œŽ" &) and ๐’ฉ(๐œ‡#, ๐œŽ# &). โ€ข ๐œŽ! & is the variance of a potential outcome ๐‘Œ! , which is fixed across bandit models ๐œˆ โˆˆ โ„ณ. 7
  8. Problem Setting 8 Arm 1 ๐‘‡ ๐‘ก = 1 ๐‘Œ!

    ๐‘Œ" Arm 2 Draw ๐ด# Observe ๐‘Œ $!,# Experimenter n Best treatment arm: an arm with the highest expected outcome, ๐‘Žโˆ— = arg max !โˆˆ{",#} ๐œ‡! . For simplicity, we assume that the best arm is unique. n Bandit process: In each round ๐‘ก โˆˆ {1,2, โ€ฆ , ๐‘‡}, under a bandit model ๐œˆ โˆˆ โ„ณ, โ€ข Draw a treatment arm ๐ด% โˆˆ {1,0}. โ€ข Observe an outcome ๐‘Œ +!,% of the chosen arm ๐ด% , โ€ข Stop the trial at round ๐‘ก = ๐‘‡ โ€ข After the final round ๐‘‡, an algorithm recommends an estimated best treatment arm ? ๐‘Ž, โˆˆ {1,0}.
  9. Best Arm Identification (BAI) Strategy n Goal๏ผšFind the best treatment

    arm ๐‘Žโˆ— efficiently with some performance metric. n Our actions: Using past observations, we can optimize ๐ด% during the bandit process. n These actions are components of algorithms for BAI, called a BAI strategy. โ€ข Sampling rule (๐ด", ๐ด&, โ€ฆ ): How we draw a treatment arm in each round ๐‘ก. โ€ข Recommendation rule ? ๐‘Ž, โˆˆ {1,0}: Which treatment arm we recommend as the best arm. In other settings, stopping time (budget) of an adaptive experiment can be random variable as well as sequential testing. 9
  10. Contributions n Let โ„™- and ๐”ผ- be a probability law

    and expectation under a bandit model ๐œˆ โˆˆ โ„ณ. n Probability of misidentification โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— . = A probability of an event that we recommend a suboptimal arm instead of the best arm. ร˜Main result of Kato, Ariu, Imaizumi, Nomura, and Qin (2022). n Optimal strategy for minimization of the probability of misidentification under a small gap. โ€ข Propose a strategy using the Neyman allocation rule and the AIPW estimator. โ€ข Consider a lower bound of โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— that any strategy cannot exceeds. โ€ข The probability of misidentification matches the lower bound when ๐œ‡" โˆ’ ๐œ‡# โ†’ 0. 10
  11. Probability of Misidentification n Assume that the best arm ๐‘Žโˆ—

    is unique. โ€ข โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— converges to 0 with an exponential speed: โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— = exp(โˆ’๐‘‡(โ‹†)) for a constant (โ‹†). ร˜Consider evaluating the term (โ‹†) by lim sup ,โ†’/ โˆ’ 1 ๐‘‡ log โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— . โ€ข A performance lower (upper) bound of โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— is an upper (lower) bound of lim sup ,โ†’/ โˆ’ " , log โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— . cf. Kaufmann et al. (2016). nLarge deviation analysis: tight evaluation of โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— 11
  12. Lower Bound n Kaufmann et al. (2016) gives a lower

    bound for two-armed Gaussian bandit models. โ€ข To derive a lower bound, we restrict a class of strategies. ร˜Definition: consistent strategy. โ€ข A strategy is called consistent for a class โ„ณ if for each ๐œˆ โˆˆ โ„ณ, โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— โ†’ 1. โ€ข For any bandit model ๐œˆ โˆˆ โ„ณ, any consistent strategy satisfies lim sup ,โ†’/ โˆ’ 1 ๐‘‡ log โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— โ‰ค ฮ”& 2 ๐œŽ" + ๐œŽ# & . n Any strategy cannot exceed this convergence rate of the probability of misidentification. A lower bound of the probability of misidentification โ„™" " ๐‘Ž# โ‰  ๐‘Žโˆ— is an upper bound of % # log โ„™" " ๐‘Ž# โ‰  ๐‘Žโˆ— . ร˜Optimal strategy: a strategy under which โ„™- ? ๐‘Ž, โ‰  ๐‘Žโˆ— matches the lower bound. 12 Lower bound (Theorem 12 in Kaufmann et al., 2016)
  13. Neyman Allocation Rule n Target allocation ratios. โ€ข A ratio

    of the expected number of arm draws " , ๐”ผ- โˆ‘%0" , 1 ๐ด% = ๐‘Ž under a sampling rule. = % # ๐”ผ" โˆ‘&'% # 1 ๐ด& = ๐‘Ž / โˆ‘(โˆˆ[+] % # ๐”ผ" โˆ‘&'% # 1 ๐ด& = ๐‘ ร˜Neyman allocation rule. โ€ข Target allocation ratio is the ratio of the standard deviations. = Draw a treatment arm as % # ๐”ผ" โˆ‘&'% # 1 ๐ด& = 1 : % # ๐”ผ" โˆ‘&'% # 1 ๐ด& = 0 = ๐œŽ% : ๐œŽ- . n When the standard deviations ๐œŽ" and ๐œŽ# are known, the Neyman allocation is optimal. cf. Glynn and Juneja (2004), and Kaufmann et al. (2016). ร˜An optimal strategy is unknown when the standard deviations are unknown. n In our strategy, we estimate (๐œŽ", ๐œŽ#) and draw an arm ๐‘Ž with the probability 1 2. 1 2/31 20 . 13
  14. NA-AIPW Strategy n Proposed strategy: NA-AIPW strategy. โ€ข NA: sampling

    rule following the Neyman Allocation rule. โ€ข AIPW: recommendation rule using an Augmented Inverse Probability Weighting (AIPW) estimator. ร˜Procedure of the NA-AIPW strategy: 1. In each round ๐‘ก โˆˆ [๐‘‡], estimate ๐œŽ! & using observations obtained until round ๐‘ก. 2. Draw a treatment arm ๐‘Ž โˆˆ {1,0} with a probability S ๐‘ค%(๐‘Ž) = 1 2.,! 1 2/,!31 20,! (Neyman allocation rule). 3. In round ๐‘‡, estimate ๐œ‡! using the AIPW estimator: ฬ‚ ๐œ‡&,' ()*+ = ! ' โˆ‘#,! ' ! $!,& -",!./ 0",! / 1!(&) + ฬ‚ ๐œ‡&,# . ฬ‚ ๐œ‡&,# = ! โˆ‘$%& ! ![$$,&] โˆ‘7,! # 1 ๐ด7 = ๐‘Ž ๐‘Œ&,# is an estimator of ๐œ‡& using observations until round ๐‘ก. 4. Recommend ? ๐‘Ž, 4567 = arg max !โˆˆ ",# ? ๐œ‡!,, 4567 as an estimated best treatment arm. 14
  15. Upper Bound and Asymptotic Optimality 15 โ€ข Assume some regularity

    conditions. โ€ข Suppose that the estimator S ๐‘ค% converges to ๐‘คโˆ— almost surely (with a certain rate). โ€ข Then, for any ๐œˆ โˆˆ โ„ณsuch that 0 < ๐œ‡" โˆ’ ๐œ‡# โ‰ค ๐ถ for some constant ๐ถ > 0, the upper bound is lim sup 'โ†’9 โˆ’ 1 ๐‘‡ log โ„™: 8 ๐‘Ž' ()*+ โ‰  ๐‘Žโˆ— โ‰ฅ ฮ”< 2 ๐œŽ! + ๐œŽ" < โˆ’ = ๐ถ ฮ”= + ฮ”> , where X ๐ถ is some constant. โ€ข This result implies that lim 8โ†’# lim sup ,โ†’/ โˆ’ " 82, log โ„™- ? ๐‘Ž, 4567 โ‰  ๐‘Žโˆ— โ‰ฅ " & 2/320 2 โˆ’ ๐‘œ 1 . โ€ข Under a small-gap regime (ฮ” = ๐œ‡" โˆ’ ๐œ‡# โ†’ 0), the upper and lower bounds match = The NA-AIPW strategy is asymptotically optimal under the small gap. Theorem (Upper bound) When potential outcomes follow Bernoulli distributions, an RCT (drawing each arm with probability 1/2) is approximately optimal (Kaufmann et al., 2016).
  16. On the Optimality under the Small Gap ร˜Asymptotically optimal strategy

    under a small gap. โ€ข This result implies the worst-case optimality of the proposed algorithm. n A technical reason for the small gap. โ€ข There is no optimal strategy when the gap is fixed, and the standard deviations are unknown. โ†” When the standard deviations are known, the Neyman allocation is known to be optimal. cf. Chen et al. (2000), Glynn and Juneja (2004), Kaufmann et al. (2016), and Carpentier and Locatteli, 2016). n Conjecture. โ€ข When the gap is small, we can ignore the estimation error of the standard deviations. โ†‘The estimation error is trivial compared with the difficulty of identifying the best arm under the small gap. รผ Optimality under a large gap (constant ๐œ‡% โˆ’ ๐œ‡-) is an open issue. 16
  17. Simulation Studies ร˜Empirical performance of the NA-AIPW (NA) strategy. n

    Compare the NA strategy with the ๐›ผ-elimination (Alpha) and Uniform sampling (Uniform). The ๐›ผ-elimination is a strategy using the Neyman allocation when the standard deviations are known (Kaufmann et al., 2016). The uniform sampling draw each treatment arm with equal probability. A randomized controlled trial without adaptation. โ€ข Setting 1: ๐œ‡" = 0.05, ๐œ‡# = 0.01, ๐œŽ" & = 1, ๐œŽ# & = 0.2. โ€ข Setting 2: ๐œ‡" = 0.05, ๐œ‡# = 0.01, ๐œŽ" & = 1, ๐œŽ# & = 0.1. โ€ข Strategies using the Neyman allocation outperform the RCT. 17 Setting 1 Setting 2 We draw treatment arm 1 in Setting 2 more often than in Setting 1.
  18. Limitations of the Neyman Allocation Rule ร˜I briefly introduce my

    ongoing studies. โ€ข Several results are still conjectures and not published. n Limitations of the Neyman allocation rule. โ€ข Only consider a case where there are two treatment arms. โ€ข Not use covariates (contextual information). n Consider extensions of the NA-AIPW strategy. 19
  19. Problem Setting with Multiple Treatment Arms and Contextual Information n

    ๐‘ฒ treatment arms: ๐พ = {1,2, โ€ฆ , ๐พ}. n Covariate (context): ๐‘‘-dimensional random variable ๐‘‹ โˆˆ ๐’ณ โŠ‚ โ„9. Side information such as a feature of arms. โ€ข (๐‘Œ", โ€ฆ . , ๐‘Œ:) are conditionally independent given ๐‘‹. n Let ๐œˆ be a joint distribution of (๐‘Œ", โ€ฆ . , ๐‘Œ:, ๐‘‹), called a bandit model. โ€ข Denote the mean outcome of treatment arm ๐‘Ž by ๐œ‡!(๐œˆ) = ๐”ผ- ๐‘Œ!,% and conditional mean outcome by ๐œ‡!(๐œˆ)(๐‘ฅ) = ๐”ผ- ๐‘Œ!,%|๐‘‹% = ๐‘ฅ . n Best treatment arm: an arm with the highest expected outcome, ๐‘Žโˆ—(๐œˆ) = arg max !โˆˆ[:] ๐œ‡!(๐œˆ). When there are multiple best treatment arms, we denote the set by ๐’œโˆ—(๐œˆ). n Not conditioned on a covariate (context) ๐‘‹% = ๐‘ฅ. ex. Average treatment effect estimation via adaptive experiments, such as Hahn, Hirano, Karlan (2011). 20
  20. Problem Setting n Bandit process: In each round ๐‘ก โˆˆ

    {1,2, โ€ฆ , ๐‘‡}, under a bandit model ๐œˆ, โ€ข Observe a context (covariate) ๐‘ฟ๐’• โˆˆ ๐“ง. โ€ข Draw a treatment arm ๐ด% โˆˆ [๐พ]. โ€ข Observe an outcome ๐‘Œ +!,% of chosen arm ๐ด% , โ€ข Stop the trial at round ๐‘ก = ๐‘‡ โ€ข After the final round ๐‘‡, an algorithm recommends an estimated best treatment arm ? ๐‘Ž, โˆˆ [๐พ]. 21 Arm 1 ๐‘‡ ๐‘ก = 1 ๐‘Œ!,# ๐‘Œ<,# ๐‘Œ?,# โ‹ฎ Arm 2 Arm ๐พ Observe ๐‘‹# Draw ๐ด# Observe ๐‘Œ $!,# Experimenter
  21. Location-Shift Bandit Class ร˜Definition: Location-shift bandit class. n We call

    a class ๐’ซ of bandit models a location-shift bandit class if โ€ข For all ๐œˆ โˆˆ ๐’ซ and ๐‘ฅ โˆˆ ๐’ณ, the conditional variance of ๐‘Œ!,% is constant. = For all ๐‘Ž โˆˆ [๐พ] and any ๐‘ฅ โˆˆ ๐’ณ, there exists a constant ๐œŽ3 4 ๐‘ฅ such that Var" ๐‘Œ3,& ๐‘‹& = ๐‘ฅ = ๐œŽ3 4 ๐‘ฅ for all ๐œˆ โˆˆ ๐’ซ. โ€ข For all ๐œˆ โˆˆ ๐’ซ, ๐‘‹ has the same distribution and denote the density by ๐œ(๐‘ฅ). ex. Gaussian distributions with fixed variances. 22
  22. Class of Strategies n To derive lower bounds in a

    case with multiple treatment arms, we further restrict a class of BAI strategies in addition to consistent strategies. ร˜Definition: Asymptotically invariant strategy. โ€ข A strategy is asymptotically invariant for ๐’ซ if for any ๐œˆ, ๐œ โˆˆ ๐’ซ, for all ๐‘Ž โˆˆ K and ๐‘ฅ โˆˆ ๐’ณ, ๐”ผ- k %0" , 1[๐ด% = ๐‘Ž] |๐‘‹% = ๐‘ฅ = ๐”ผ> k %0" , 1[๐ด% = ๐‘Ž] |๐‘‹% = ๐‘ฅ . รผ I conjecture that if potential results follow a particular distribution, such as Bernoulli distributions, such a restriction may not be necessary, and an RCT (uniform sampling) is optimal. 23
  23. Lower Bound under the Small Gap โ€ข Consider a location-shift

    bandit class ๐’ซ and ๐œˆ โˆˆ ๐’ซ. โ€ข Assume that there is a unique best treatment arm ๐‘Žโˆ—(๐œˆ). โ€ข Assume that for all ๐‘Ž โˆˆ [๐พ], there exist ฮ”" > 0 such that ๐œ‡&โˆ—(:) (๐œˆ) โˆ’ ๐œ‡& (๐œˆ) < ฮ”. โ€ข Then, for any ๐œˆ in a location-shift class, any consistent and asymptotically invariant strategy satisfies if ๐พ = 2: lim sup 'โ†’9 โˆ’ ! ' log โ„™: 8 ๐‘Ž' โˆ— โ‰  ๐‘Žโˆ—(๐œˆ) โ‰ค @( <โˆซ B& C DB( C ( E C FC + ๐ถ! ฮ”=; if ๐พ โ‰ฅ 3 and strategy is invariant: lim sup 'โ†’9 โˆ’ ! ' log โ„™: 8 ๐‘Ž' โˆ— โ‰  ๐‘Žโˆ—(๐œˆ) โ‰ค @( < โˆ‘ )โˆˆ + โˆซ B) ( C E C FC + ๐ถ< ฮ”=, where ๐ถ! , ๐ถ< > 0 are some constant. 24 Theorem (Lower bound) Small gap
  24. Target Allocation Ratio ร˜Target allocation ratio ๐‘คโˆ—: ๐พ ร—๐’ณ โ†’

    (0,1) such that โˆ‘!โˆˆ[:] ๐‘คโˆ—(๐‘Ž|๐‘ฅ ) = 1 for any ๐‘ฅ โˆˆ ๐’ณ. (Target allocation ratio: a ratio of the expected number of arm draws % # ๐”ผ" โˆ‘&'% # 1 ๐ด& = ๐‘Ž |๐‘‹& = ๐‘ฅ ) n The lower bound suggests drawing an arm ๐‘Ž with the following probability ๐‘คโˆ—(๐‘Ž|๐‘‹%): โ€ข If ๐พ = 2, ๐‘คโˆ— 1 ๐‘‹% = 2/ (@!) 2/ @! 322(@!) , and ๐‘คโˆ— 2 ๐‘‹% = 22(@!) 2/ @! 322(@!) . โ€ข If ๐พ โ‰ฅ 3, ๐‘คโˆ— ๐‘Ž ๐‘‹% = 2. 2(@!) โˆ‘ 5โˆˆ 6 25 2(@!) . ร˜Beyond the Neyman allocation rule: when ๐พ โ‰ฅ 3, draw arms with the ratio of the variances. 25
  25. RS-AIPW Strategy n Algorithm (strategy): RS-AIPW strategy. RS: Random Sampling

    of each treatment arm ร˜Procedure of Contextual RS-AIPW strategy: 1. In each round ๐‘ก โˆˆ [๐‘‡], estimate ๐œŽ!(๐‘‹%). 2. Using estimators of ๐œŽ!(๐‘‹%), estimate ๐‘คโˆ—. 3. Draw a treatment arm with the estimated probability S ๐‘ค% . 4. In round ๐‘‡, estimate ๐œ‡!(๐œˆ) using the AIPW estimator: ! ๐œ‡3,# 789: = % # โˆ‘&'% # % ;!'3 <",!=> ?",!(A!) > C!(3|A!) + ! ๐œ‡3,& ๐‘‹& . ฬ‚ ๐œ‡&,# (๐‘‹# ): an estimator of ๐œ‡& (๐œˆ) using samples until round ๐‘ก. This estimator consists a martingale difference sequence. 5. Recommend ? ๐‘Ž, 4567 = arg max !โˆˆ[:] ? ๐œ‡!,, 4567 as an estimated best treatment arm. n The RS-AIPW strategy is asymptotically optimal under the small gap as well as the NA-AIPW strategy. 26
  26. Expected Simple Regret n Simple regret: ๐‘Ÿ, ๐œˆ = ๐œ‡!โˆ—

    - ๐œˆ โˆ’ ๐œ‡C !E ๐œˆ under a bandit model ๐œˆ (there is a randomness of " ๐‘Ž# ๐œˆ ). n Expected simple regret: ๐”ผ- ๐‘Ÿ% ๐œˆ = ๐”ผ- ๐œ‡!โˆ— D ๐œˆ โˆ’ ๐œ‡C !E ๐œˆ . โ€ข ๐”ผD is the expectation over ? ๐‘Ž, ๐œˆ . โ€ข The expected simple regret represents an expected relative welfare loss. n Let a gap between the expected outcomes of arms ๐‘Ž, ๐‘ โˆˆ [๐พ] be ฮ”!,E(๐œˆ) = ๐œ‡! ๐œˆ โˆ’ ๐œ‡E ๐œˆ . n By using the gap ฮ”!,E ๐œˆ = ๐œ‡! ๐œˆ โˆ’ ๐œ‡E ๐œˆ , the expected regret can be decomposed as ๐”ผ- ๐‘Ÿ% ๐œˆ = ๐”ผ- ๐œ‡!โˆ— - ๐œˆ โˆ’ ๐œ‡C !E ๐œˆ = k Eโˆ‰๐’œโˆ— - ฮ”!โˆ— - ,E(๐œˆ) โ„™- ? ๐‘Ž, = ๐‘ ร˜Consider minimization of the expected simple regret. 28 A set of the best treatment arms.
  27. Expected Simple Regret n The rate of the expected simple

    regret ๐”ผG ๐‘Ÿ# ๐œˆ = โˆ‘HI&โˆ—(G) ฮ”&โˆ— : ,H(๐œˆ) โ„™: 8 ๐‘Ž' โˆ— = ๐‘ . n For some constant ๐ถ > 0, the rate is given as ๐”ผ: ๐‘Ÿ# (๐œˆ) = โˆ‘Hโˆ‰๐’œโˆ— : ฮ”&โˆ— G ,H (๐œˆ) exp โˆ’๐ถ๐‘‡ ฮ”&โˆ— G ,H(๐œˆ) < . The speed of convergence to zero of ฮ”&โˆ— G ,H(๐œˆ) affects the of ๐”ผ: ๐‘Ÿ# ๐‘ƒ regarding ๐‘‡. 1. ฮ”&โˆ— G ,H(๐œˆ) is slower than 1/ ๐‘‡ โ†’ For some increasing function ๐‘”(๐‘‡)๏ผŒ๐”ผ: ๐‘Ÿ# ๐œˆ โ‰ˆ exp โˆ’๐‘” ๐‘‡ . 2. ฮ”&โˆ— G ,H ๐œˆ = ๐ถ! / ๐‘‡ for some constant ๐ถ! โ†’ For some constant ๐ถ< > 0, ๐”ผ: ๐‘Ÿ# ๐œˆ โ‰ˆ L( ' . 3. ฮ”&โˆ— G ,H(๐œˆ) is faster than 1/ ๐‘‡ โ†’ ๐”ผ: ๐‘Ÿ# ๐œˆ โ‰ˆ ๐‘œ(1/ ๐‘‡) โ†’ In the worst case, ฮ”&โˆ— : ,H converges to zero with with ๐ถ! / ๐‘‡ ๏ผˆBubeck et al., 2011๏ผ‰. cf. Hirano and Porter (2008), and limit of experiment framework. รผ When ๐œŸ๐’‚,๐’ƒ ๐‚ is independent from ๐‘ป, evaluation of the expected simple regret is equal to that of the probability of misidentification. โ€ข โ„™& ' ๐‘Ž' โˆ— = ๐‘ converges to zero with an exponential speed if ฮ”),* ๐œˆ is independent from ๐‘‡๏ผŽฮ”)โˆ— & ,* is ignorable. โ†’ For some constant (โ‹†), if โ„™& ' ๐‘Ž' โˆ— = ๐‘ โ‰ˆ exp(โˆ’๐‘‡(โ‹†)) for ๐‘ โˆ‰ ๐’œโˆ— ๐œˆ , then ๐”ผ& ๐‘Ÿ+ ๐œˆ, โ‰ˆ exp(โˆ’๐‘‡(โ‹†)). cf. Kasy and Sautmann (2021). 29
  28. Optimal Strategy when the Support is Bounded ร˜The result of

    Bubeck et al. (2011). n Definition: The class of bounded bandit models ๐’ซ[H,I]: for any ๐‘Ž โˆˆ [๐พ], ๐‘Œ! โˆˆ [๐›ผ, ๐›ฝ]. n Lower bound๏ผš For ๐‘‡ โ‰ฅ ๐พ โ‰ฅ 2๏ผŒany strategy satisfies sup -โˆˆ๐’ซ[0,/] ๐”ผ- ๐‘Ÿ, ๐œˆ โ‰ฅ " &# : , . n Upper bound๏ผš For uniform sampling (RCT) + recommending an arm with the highest sample average, sup :โˆˆ๐’ซ[-,&] ๐”ผ: ๐‘Ÿ' ๐œˆ โ‰ค ๐พ log ๐พ ๐‘‡ . ร˜ Consider the worst-case expected simple regret for bandit models with infinite support (location-shift). 30
  29. Lower Bound n Definition: Null bounded strategy. โ€ข A strategy

    is null bounded if under a bandit model ๐œˆ such that ๐œ‡! ๐œˆ = โ‹ฏ = ๐œ‡? ๐œˆ , the strategy draws each treatment as ๐”ผ: ! ' โˆ‘#,! ' 1 ๐ด# = ๐‘Ž ๐‘‹# = ๐‘ฅ โ‰ฅ ๐œ– for some small constant ๐œ– > 0. โ€ข Consider a location-shit bandit class ๐’ซ. โ€ข For any null bounded strategy, the expected simple regret is lower bounded as follows: if ๐พ = 2, sup -โˆˆ๐’ซ ๐‘‡๐”ผ- ๐‘Ÿ, ๐‘ƒ โ‰ฅ ๐ถ" โˆซ ๐œŽ" ๐‘ฅ + ๐œŽ& ๐‘ฅ & ๐œ ๐‘ฅ ๐‘‘๐‘ฅ + ๐‘œ 1 ; if ๐พ โ‰ฅ 3, sup -โˆˆ๐’ซ ๐‘‡๐”ผ- ๐‘Ÿ, ๐‘ƒ โ‰ฅ ๐ถ& โˆ‘ !โˆˆ : โˆซ ๐œŽ! ๐‘ฅ & ๐œ ๐‘ฅ ๐‘‘๐‘ฅ + ๐‘œ 1 , where ๐ถ" and ๐ถ& are some constants. 31 Theorem (Lower bound)
  30. Target Allocation Ratio n The target allocation ratios for minimization

    of the expected simple regret is the same as those for minimization of the probability of misidentification. โ€ข If ๐พ = 2, ๐‘คโˆ— 1 ๐‘‹% = 2/(@!) 2/ @! 322(@!) , and ๐‘คโˆ— 2 ๐‘‹% = 22(@!) 2/ @! 322(@!) . โ€ข If ๐พ โ‰ฅ 3, ๐‘คโˆ— ๐‘Ž ๐‘‹% = 2. 2(@!) โˆ‘5โˆˆ[6] 25 2(@!) . n Why? โ†’ Both settings consider worst-case scenarios. โ€ข In the worst case, only the variance affects the target sample allocation. ร˜The RS-AIPW strategy proposed in minimization of the probability of misidentification. This strategy is minimax optimal in the sense that its leading factor matches the lower bound. 32
  31. Summary ร˜Asymptotically optimal strategy in two-armed Gaussian BAI with a

    fixed budget. n Evaluating the performance of BAI strategies by the probability of misidentification. โ€ข The Neyman allocation rule is globally optimal when the standard deviations are known. = The Neyman allocation is known to be asymptotically optimal when potential outcomes of two treatment arms follow Gaussian distributions with any mean parameters and fixed variances. n Result of Kato, Ariu, Imaizumi, and Qin (2022). โ€ข The standard deviations are unknown and estimated during an experiment. โ€ข Under the NA-AIPW strategy, the probability of misidentification matches the lower bound when the gap between expected outcomes goes to zero. โ†’ The strategy based on the Neyman allocation is the worst-case optimal (small-gap optimal). 34
  32. Reference โ€ข Kato, M., Ariu, K., Imaizumi, M., Nomura, M.,

    and Qin, C. (2022), โ€œBest Arm Identification with a Fixed Budget under a Small Gap.โ€ โ€ข Audibert, J.-Y., Bubeck, S., and Munos, R. (2010), โ€œBest Arm Identification in Multi-Armed Bandits,โ€ in COLT. โ€ข Bubeck, S., Munos, R., and Stoltz, G. (2011), โ€œPure exploration in finitely-armed and continuous-armed bandits,โ€ Theoretical Computer Science. โ€ข Carpentier, A. and Locatelli, A. (2016), โ€œTight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem,โ€ in COLT โ€ข Chen, C.-H., Lin, J., Yรผcesan, E., and Chick, S. E (2000). Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems, โ€ข Garivier, A. and Kaufmann, E. (2016), โ€œOptimal Best Arm Identification with Fixed Confidence,โ€ in COLT. โ€ข Glynn, P. and Juneja, S. (2004), โ€œA large deviations perspective on ordinal optimization,โ€ in Proceedings of the 2004 Winter Simulation Conference, IEEE. โ€ข Kaufmann, E., Cappรฉ, O., and Garivier, A. (2016), โ€œOn the Complexity of Best-Arm Identification in Multi-Armed Bandit Models,โ€ JMLR. โ€ข Lai, T. and Robbins, H. (1985), โ€œAsymptotically efficient adaptive allocation rules,โ€ Advances in Applied Mathematics. โ€ข Manski, C. F. (2000), โ€Identification problems and decisions under ambiguity: Empirical analysis of treatment response and normative analysis of treatment choice,โ€ Journal of Econometrics. - (2002), โ€Treatment choice under ambiguity induced by inferential problems,โ€ Journal of Statistical Planning and Inference. - (2004), โ€œStatistical treatment rules for heterogeneous populations,โ€ Econometrica. โ€ข Manski, C. F. and Tetenov, A. (2016), โ€œSufficient trial size to inform clinical practice,โ€ Proceedings of the National Academy of Science. โ€ข Stoye, J. (2009), โ€œMinimax regret treatment choice with finite samples,โ€ Journal of Econometrics. 35