Best Arm Identification with a Fixed Budget under a Small Gap

Best Arm Identification with a Fixed Budget under a Small
Gap Masahiro Kato (CyberAgent, Inc. / University of Tokyo) Session: Adaptive Experimental Design for Policy Choice and Policy Learning Kato, Ariu, Imaizumi, Nomura, and Qin (2022), https://arxiv.org/abs/2201.04469. 1

Experimental Design for Better Decision-Making ØKeywords: Causal inference, decision-making, and
experimental design. n Treatment arm (arm / treatment / policy). ex. drugs, advertisements, and economic policies. Each treatment arm has a potential outcome. By drawing an arm, we can observe the outcome. ØEconomics: From treatment effect estimation to treatment choice (decision-making). • Treatment (policy) choice: Choose the best treatment arm (policy) using observations. cf. Manski (2000, 2002, 2004), Stoye (2009, 2012), Manski and Tetenov (2016). ØMulti-armed bandit problem: Optimize decision-making via adaptive experiments. • Regret minimization: Choose the treatment arms to maximize the cumulative reward during the experiment. cf. Gittins (1979), and Lai and Robbins (1985). • Best arm identification (BAI): Choose the best treatment arm after the experiment. cf. Bubeck et al. (2011), Kaufmann et al. (2016), and Kasy and Sautmann (2021). cf. Average treatment effect estimation via adaptive experimental design: van der Laan (2008), Hahn, Hirano, and Karlan (2011). 2

BAI with a Fixed Budget ØConsider an adaptive experiment where
we can draw a treatment arm in each round. Draw a treatment arm = allocate a treatment arm to an experimental unit and observe the realized outcome. ØBAI with a fixed budget. • The number of rounds of an adaptive experiments (budget / sample size) is predetermined. • Recommend the best arm from multiple candidates after the experiment. ↔ BAI with fixed confidence: continue adaptive experiments until a certain criterion is satisfied. cf. sequential experiments. ØEvaluation metrics: • Probability of misidentifying the best treatment arm. • Expected simple regret (difference between the expected outcomes of best and suboptimal arms). n Goal: recommend the best arm with smaller probability of misidentification or expected simple regret. 3

Contents n Discuss asymptotically optimal algorithms in BAI with a
fixed budget. • Consider a case with two treatment arms. ex. treatment and control groups. • Potential outcomes follow Gaussian distributions. • Minimization of the probability of misidentification. n Neyman allocation rule: • Draw a treatment arm with the ratio of the standard deviations of the potential outcomes. • When the standard deviations are known, the Neyman allocation is known to be optimal. cf. Chen et al. (2000), Glynn and Juneja (2004), and Kaufmann et al. (2016). 4

Contents ØToday, I introduce our paper: Kato, Ariu, Imaizumi, Nomura,
and Qin (2022), “Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap.” • The standard deviations are unknown and estimated in an adaptive experiment. • The worst-case asymptotic optimality of the Neyman allocation rule. n In addition to the above paper, I introduce several new findings in our project. • Beyond the Neyman allocation rule. • Minimization of the expected simple regret. 5 * * https://arxiv.org/abs/2201.04469.

Optimal Best Arm Identification in Two-Armed Bandits with a Fixed
Budget under a Small Gap Kato, Ariu, Imaizumi, Nomura, and Qin (2022) 6

Problem Setting n Adaptive experiment with 𝑻 rounds: T =
{1,2, … , 𝑇}. n Binary treatment arms: {1,0}. • Each treatment arm 𝑎 ∈ 1,0 has a potential outcome 𝑌! ∈ ℝ. The distributions of (𝑌", 𝑌#) do not change across rounds, and 𝑌" and 𝑌# are independent. • At round 𝑡, by drawing a treatment arm 𝑎 ∈ 1,0 , we observe 𝑌!,% , which is an iid copy of 𝑌! . ØDefinition: Two-armed Gaussian bandit models. • A class ℳ of joint distributions 𝜈 (bandit models) of (𝑌", 𝑌#). • 𝑌", 𝑌# under 𝜈 ∈ ℳ follow Gaussian distributions 𝒩(𝜇", 𝜎" &) and 𝒩(𝜇#, 𝜎# &). • 𝜎! & is the variance of a potential outcome 𝑌! , which is fixed across bandit models 𝜈 ∈ ℳ. 7

Problem Setting 8 Arm 1 𝑇 𝑡 = 1 𝑌!
𝑌" Arm 2 Draw 𝐴# Observe 𝑌 $!,# Experimenter n Best treatment arm: an arm with the highest expected outcome, 𝑎∗ = arg max !∈{",#} 𝜇! . For simplicity, we assume that the best arm is unique. n Bandit process: In each round 𝑡 ∈ {1,2, … , 𝑇}, under a bandit model 𝜈 ∈ ℳ, • Draw a treatment arm 𝐴% ∈ {1,0}. • Observe an outcome 𝑌 +!,% of the chosen arm 𝐴% , • Stop the trial at round 𝑡 = 𝑇 • After the final round 𝑇, an algorithm recommends an estimated best treatment arm ? 𝑎, ∈ {1,0}.

Best Arm Identification (BAI) Strategy n Goal：Find the best treatment
arm 𝑎∗ efficiently with some performance metric. n Our actions: Using past observations, we can optimize 𝐴% during the bandit process. n These actions are components of algorithms for BAI, called a BAI strategy. • Sampling rule (𝐴", 𝐴&, … ): How we draw a treatment arm in each round 𝑡. • Recommendation rule ? 𝑎, ∈ {1,0}: Which treatment arm we recommend as the best arm. In other settings, stopping time (budget) of an adaptive experiment can be random variable as well as sequential testing. 9

Contributions n Let ℙ- and 𝔼- be a probability law
and expectation under a bandit model 𝜈 ∈ ℳ. n Probability of misidentification ℙ- ? 𝑎, ≠ 𝑎∗ . = A probability of an event that we recommend a suboptimal arm instead of the best arm. ØMain result of Kato, Ariu, Imaizumi, Nomura, and Qin (2022). n Optimal strategy for minimization of the probability of misidentification under a small gap. • Propose a strategy using the Neyman allocation rule and the AIPW estimator. • Consider a lower bound of ℙ- ? 𝑎, ≠ 𝑎∗ that any strategy cannot exceeds. • The probability of misidentification matches the lower bound when 𝜇" − 𝜇# → 0. 10

Probability of Misidentification n Assume that the best arm 𝑎∗
is unique. • ℙ- ? 𝑎, ≠ 𝑎∗ converges to 0 with an exponential speed: ℙ- ? 𝑎, ≠ 𝑎∗ = exp(−𝑇(⋆)) for a constant (⋆). ØConsider evaluating the term (⋆) by lim sup ,→/ − 1 𝑇 log ℙ- ? 𝑎, ≠ 𝑎∗ . • A performance lower (upper) bound of ℙ- ? 𝑎, ≠ 𝑎∗ is an upper (lower) bound of lim sup ,→/ − " , log ℙ- ? 𝑎, ≠ 𝑎∗ . cf. Kaufmann et al. (2016). nLarge deviation analysis: tight evaluation of ℙ- ? 𝑎, ≠ 𝑎∗ 11

Lower Bound n Kaufmann et al. (2016) gives a lower
bound for two-armed Gaussian bandit models. • To derive a lower bound, we restrict a class of strategies. ØDefinition: consistent strategy. • A strategy is called consistent for a class ℳ if for each 𝜈 ∈ ℳ, ℙ- ? 𝑎, ≠ 𝑎∗ → 1. • For any bandit model 𝜈 ∈ ℳ, any consistent strategy satisfies lim sup ,→/ − 1 𝑇 log ℙ- ? 𝑎, ≠ 𝑎∗ ≤ Δ& 2 𝜎" + 𝜎# & . n Any strategy cannot exceed this convergence rate of the probability of misidentification. A lower bound of the probability of misidentification ℙ" " 𝑎# ≠ 𝑎∗ is an upper bound of % # log ℙ" " 𝑎# ≠ 𝑎∗ . ØOptimal strategy: a strategy under which ℙ- ? 𝑎, ≠ 𝑎∗ matches the lower bound. 12 Lower bound (Theorem 12 in Kaufmann et al., 2016)

Neyman Allocation Rule n Target allocation ratios. • A ratio
of the expected number of arm draws " , 𝔼- ∑%0" , 1 𝐴% = 𝑎 under a sampling rule. = % # 𝔼" ∑&'% # 1 𝐴& = 𝑎 / ∑(∈[+] % # 𝔼" ∑&'% # 1 𝐴& = 𝑏 ØNeyman allocation rule. • Target allocation ratio is the ratio of the standard deviations. = Draw a treatment arm as % # 𝔼" ∑&'% # 1 𝐴& = 1 : % # 𝔼" ∑&'% # 1 𝐴& = 0 = 𝜎% : 𝜎- . n When the standard deviations 𝜎" and 𝜎# are known, the Neyman allocation is optimal. cf. Glynn and Juneja (2004), and Kaufmann et al. (2016). ØAn optimal strategy is unknown when the standard deviations are unknown. n In our strategy, we estimate (𝜎", 𝜎#) and draw an arm 𝑎 with the probability 1 2. 1 2/31 20 . 13

NA-AIPW Strategy n Proposed strategy: NA-AIPW strategy. • NA: sampling
rule following the Neyman Allocation rule. • AIPW: recommendation rule using an Augmented Inverse Probability Weighting (AIPW) estimator. ØProcedure of the NA-AIPW strategy: 1. In each round 𝑡 ∈ [𝑇], estimate 𝜎! & using observations obtained until round 𝑡. 2. Draw a treatment arm 𝑎 ∈ {1,0} with a probability S 𝑤%(𝑎) = 1 2.,! 1 2/,!31 20,! (Neyman allocation rule). 3. In round 𝑇, estimate 𝜇! using the AIPW estimator: ̂ 𝜇&,' ()*+ = ! ' ∑#,! ' ! $!,& -",!./ 0",! / 1!(&) + ̂ 𝜇&,# . ̂ 𝜇&,# = ! ∑$%& ! ![$$,&] ∑7,! # 1 𝐴7 = 𝑎 𝑌&,# is an estimator of 𝜇& using observations until round 𝑡. 4. Recommend ? 𝑎, 4567 = arg max !∈ ",# ? 𝜇!,, 4567 as an estimated best treatment arm. 14

Upper Bound and Asymptotic Optimality 15 • Assume some regularity
conditions. • Suppose that the estimator S 𝑤% converges to 𝑤∗ almost surely (with a certain rate). • Then, for any 𝜈 ∈ ℳsuch that 0 < 𝜇" − 𝜇# ≤ 𝐶 for some constant 𝐶 > 0, the upper bound is lim sup '→9 − 1 𝑇 log ℙ: 8 𝑎' ()*+ ≠ 𝑎∗ ≥ Δ< 2 𝜎! + 𝜎" < − = 𝐶 Δ= + Δ> , where X 𝐶 is some constant. • This result implies that lim 8→# lim sup ,→/ − " 82, log ℙ- ? 𝑎, 4567 ≠ 𝑎∗ ≥ " & 2/320 2 − 𝑜 1 . • Under a small-gap regime (Δ = 𝜇" − 𝜇# → 0), the upper and lower bounds match = The NA-AIPW strategy is asymptotically optimal under the small gap. Theorem (Upper bound) When potential outcomes follow Bernoulli distributions, an RCT (drawing each arm with probability 1/2) is approximately optimal (Kaufmann et al., 2016).

On the Optimality under the Small Gap ØAsymptotically optimal strategy
under a small gap. • This result implies the worst-case optimality of the proposed algorithm. n A technical reason for the small gap. • There is no optimal strategy when the gap is fixed, and the standard deviations are unknown. ↔ When the standard deviations are known, the Neyman allocation is known to be optimal. cf. Chen et al. (2000), Glynn and Juneja (2004), Kaufmann et al. (2016), and Carpentier and Locatteli, 2016). n Conjecture. • When the gap is small, we can ignore the estimation error of the standard deviations. ↑The estimation error is trivial compared with the difficulty of identifying the best arm under the small gap. ü Optimality under a large gap (constant 𝜇% − 𝜇-) is an open issue. 16

Simulation Studies ØEmpirical performance of the NA-AIPW (NA) strategy. n
Compare the NA strategy with the 𝛼-elimination (Alpha) and Uniform sampling (Uniform). The 𝛼-elimination is a strategy using the Neyman allocation when the standard deviations are known (Kaufmann et al., 2016). The uniform sampling draw each treatment arm with equal probability. A randomized controlled trial without adaptation. • Setting 1: 𝜇" = 0.05, 𝜇# = 0.01, 𝜎" & = 1, 𝜎# & = 0.2. • Setting 2: 𝜇" = 0.05, 𝜇# = 0.01, 𝜎" & = 1, 𝜎# & = 0.1. • Strategies using the Neyman allocation outperform the RCT. 17 Setting 1 Setting 2 We draw treatment arm 1 in Setting 2 more often than in Setting 1.

18 Beyond the Neyman Allocation Rule

Limitations of the Neyman Allocation Rule ØI briefly introduce my
ongoing studies. • Several results are still conjectures and not published. n Limitations of the Neyman allocation rule. • Only consider a case where there are two treatment arms. • Not use covariates (contextual information). n Consider extensions of the NA-AIPW strategy. 19

Problem Setting with Multiple Treatment Arms and Contextual Information n
𝑲 treatment arms: 𝐾 = {1,2, … , 𝐾}. n Covariate (context): 𝑑-dimensional random variable 𝑋 ∈ 𝒳 ⊂ ℝ9. Side information such as a feature of arms. • (𝑌", … . , 𝑌:) are conditionally independent given 𝑋. n Let 𝜈 be a joint distribution of (𝑌", … . , 𝑌:, 𝑋), called a bandit model. • Denote the mean outcome of treatment arm 𝑎 by 𝜇!(𝜈) = 𝔼- 𝑌!,% and conditional mean outcome by 𝜇!(𝜈)(𝑥) = 𝔼- 𝑌!,%|𝑋% = 𝑥 . n Best treatment arm: an arm with the highest expected outcome, 𝑎∗(𝜈) = arg max !∈[:] 𝜇!(𝜈). When there are multiple best treatment arms, we denote the set by 𝒜∗(𝜈). n Not conditioned on a covariate (context) 𝑋% = 𝑥. ex. Average treatment effect estimation via adaptive experiments, such as Hahn, Hirano, Karlan (2011). 20

Problem Setting n Bandit process: In each round 𝑡 ∈
{1,2, … , 𝑇}, under a bandit model 𝜈, • Observe a context (covariate) 𝑿𝒕 ∈ 𝓧. • Draw a treatment arm 𝐴% ∈ [𝐾]. • Observe an outcome 𝑌 +!,% of chosen arm 𝐴% , • Stop the trial at round 𝑡 = 𝑇 • After the final round 𝑇, an algorithm recommends an estimated best treatment arm ? 𝑎, ∈ [𝐾]. 21 Arm 1 𝑇 𝑡 = 1 𝑌!,# 𝑌<,# 𝑌?,# ⋮ Arm 2 Arm 𝐾 Observe 𝑋# Draw 𝐴# Observe 𝑌 $!,# Experimenter

Location-Shift Bandit Class ØDefinition: Location-shift bandit class. n We call
a class 𝒫 of bandit models a location-shift bandit class if • For all 𝜈 ∈ 𝒫 and 𝑥 ∈ 𝒳, the conditional variance of 𝑌!,% is constant. = For all 𝑎 ∈ [𝐾] and any 𝑥 ∈ 𝒳, there exists a constant 𝜎3 4 𝑥 such that Var" 𝑌3,& 𝑋& = 𝑥 = 𝜎3 4 𝑥 for all 𝜈 ∈ 𝒫. • For all 𝜈 ∈ 𝒫, 𝑋 has the same distribution and denote the density by 𝜁(𝑥). ex. Gaussian distributions with fixed variances. 22

Class of Strategies n To derive lower bounds in a
case with multiple treatment arms, we further restrict a class of BAI strategies in addition to consistent strategies. ØDefinition: Asymptotically invariant strategy. • A strategy is asymptotically invariant for 𝒫 if for any 𝜈, 𝜐 ∈ 𝒫, for all 𝑎 ∈ K and 𝑥 ∈ 𝒳, 𝔼- k %0" , 1[𝐴% = 𝑎] |𝑋% = 𝑥 = 𝔼> k %0" , 1[𝐴% = 𝑎] |𝑋% = 𝑥 . ü I conjecture that if potential results follow a particular distribution, such as Bernoulli distributions, such a restriction may not be necessary, and an RCT (uniform sampling) is optimal. 23

Lower Bound under the Small Gap • Consider a location-shift
bandit class 𝒫 and 𝜈 ∈ 𝒫. • Assume that there is a unique best treatment arm 𝑎∗(𝜈). • Assume that for all 𝑎 ∈ [𝐾], there exist Δ" > 0 such that 𝜇&∗(:) (𝜈) − 𝜇& (𝜈) < Δ. • Then, for any 𝜈 in a location-shift class, any consistent and asymptotically invariant strategy satisfies if 𝐾 = 2: lim sup '→9 − ! ' log ℙ: 8 𝑎' ∗ ≠ 𝑎∗(𝜈) ≤ @( <∫ B& C DB( C ( E C FC + 𝐶! Δ=; if 𝐾 ≥ 3 and strategy is invariant: lim sup '→9 − ! ' log ℙ: 8 𝑎' ∗ ≠ 𝑎∗(𝜈) ≤ @( < ∑ )∈ + ∫ B) ( C E C FC + 𝐶< Δ=, where 𝐶! , 𝐶< > 0 are some constant. 24 Theorem (Lower bound) Small gap

Target Allocation Ratio ØTarget allocation ratio 𝑤∗: 𝐾 ×𝒳 →
(0,1) such that ∑!∈[:] 𝑤∗(𝑎|𝑥 ) = 1 for any 𝑥 ∈ 𝒳. (Target allocation ratio: a ratio of the expected number of arm draws % # 𝔼" ∑&'% # 1 𝐴& = 𝑎 |𝑋& = 𝑥 ) n The lower bound suggests drawing an arm 𝑎 with the following probability 𝑤∗(𝑎|𝑋%): • If 𝐾 = 2, 𝑤∗ 1 𝑋% = 2/ (@!) 2/ @! 322(@!) , and 𝑤∗ 2 𝑋% = 22(@!) 2/ @! 322(@!) . • If 𝐾 ≥ 3, 𝑤∗ 𝑎 𝑋% = 2. 2(@!) ∑ 5∈ 6 25 2(@!) . ØBeyond the Neyman allocation rule: when 𝐾 ≥ 3, draw arms with the ratio of the variances. 25

RS-AIPW Strategy n Algorithm (strategy): RS-AIPW strategy. RS: Random Sampling
of each treatment arm ØProcedure of Contextual RS-AIPW strategy: 1. In each round 𝑡 ∈ [𝑇], estimate 𝜎!(𝑋%). 2. Using estimators of 𝜎!(𝑋%), estimate 𝑤∗. 3. Draw a treatment arm with the estimated probability S 𝑤% . 4. In round 𝑇, estimate 𝜇!(𝜈) using the AIPW estimator: ! 𝜇3,# 789: = % # ∑&'% # % ;!'3 <",!=> ?",!(A!) > C!(3|A!) + ! 𝜇3,& 𝑋& . ̂ 𝜇&,# (𝑋# ): an estimator of 𝜇& (𝜈) using samples until round 𝑡. This estimator consists a martingale difference sequence. 5. Recommend ? 𝑎, 4567 = arg max !∈[:] ? 𝜇!,, 4567 as an estimated best treatment arm. n The RS-AIPW strategy is asymptotically optimal under the small gap as well as the NA-AIPW strategy. 26

Expected Simple Regret Minimization 27

Expected Simple Regret n Simple regret: 𝑟, 𝜈 = 𝜇!∗
- 𝜈 − 𝜇C !E 𝜈 under a bandit model 𝜈 (there is a randomness of " 𝑎# 𝜈 ). n Expected simple regret: 𝔼- 𝑟% 𝜈 = 𝔼- 𝜇!∗ D 𝜈 − 𝜇C !E 𝜈 . • 𝔼D is the expectation over ? 𝑎, 𝜈 . • The expected simple regret represents an expected relative welfare loss. n Let a gap between the expected outcomes of arms 𝑎, 𝑏 ∈ [𝐾] be Δ!,E(𝜈) = 𝜇! 𝜈 − 𝜇E 𝜈 . n By using the gap Δ!,E 𝜈 = 𝜇! 𝜈 − 𝜇E 𝜈 , the expected regret can be decomposed as 𝔼- 𝑟% 𝜈 = 𝔼- 𝜇!∗ - 𝜈 − 𝜇C !E 𝜈 = k E∉𝒜∗ - Δ!∗ - ,E(𝜈) ℙ- ? 𝑎, = 𝑏 ØConsider minimization of the expected simple regret. 28 A set of the best treatment arms.

Expected Simple Regret n The rate of the expected simple
regret 𝔼G 𝑟# 𝜈 = ∑HI&∗(G) Δ&∗ : ,H(𝜈) ℙ: 8 𝑎' ∗ = 𝑏 . n For some constant 𝐶 > 0, the rate is given as 𝔼: 𝑟# (𝜈) = ∑H∉𝒜∗ : Δ&∗ G ,H (𝜈) exp −𝐶𝑇 Δ&∗ G ,H(𝜈) < . The speed of convergence to zero of Δ&∗ G ,H(𝜈) affects the of 𝔼: 𝑟# 𝑃 regarding 𝑇. 1. Δ&∗ G ,H(𝜈) is slower than 1/ 𝑇 → For some increasing function 𝑔(𝑇)，𝔼: 𝑟# 𝜈 ≈ exp −𝑔 𝑇 . 2. Δ&∗ G ,H 𝜈 = 𝐶! / 𝑇 for some constant 𝐶! → For some constant 𝐶< > 0, 𝔼: 𝑟# 𝜈 ≈ L( ' . 3. Δ&∗ G ,H(𝜈) is faster than 1/ 𝑇 → 𝔼: 𝑟# 𝜈 ≈ 𝑜(1/ 𝑇) → In the worst case, Δ&∗ : ,H converges to zero with with 𝐶! / 𝑇 （Bubeck et al., 2011）. cf. Hirano and Porter (2008), and limit of experiment framework. ü When 𝜟𝒂,𝒃 𝝂 is independent from 𝑻, evaluation of the expected simple regret is equal to that of the probability of misidentification. • ℙ& ' 𝑎' ∗ = 𝑏 converges to zero with an exponential speed if Δ),* 𝜈 is independent from 𝑇．Δ)∗ & ,* is ignorable. → For some constant (⋆), if ℙ& ' 𝑎' ∗ = 𝑏 ≈ exp(−𝑇(⋆)) for 𝑏 ∉ 𝒜∗ 𝜈 , then 𝔼& 𝑟+ 𝜈, ≈ exp(−𝑇(⋆)). cf. Kasy and Sautmann (2021). 29

Optimal Strategy when the Support is Bounded ØThe result of
Bubeck et al. (2011). n Definition: The class of bounded bandit models 𝒫[H,I]: for any 𝑎 ∈ [𝐾], 𝑌! ∈ [𝛼, 𝛽]. n Lower bound： For 𝑇 ≥ 𝐾 ≥ 2，any strategy satisfies sup -∈𝒫[0,/] 𝔼- 𝑟, 𝜈 ≥ " &# : , . n Upper bound： For uniform sampling (RCT) + recommending an arm with the highest sample average, sup :∈𝒫[-,&] 𝔼: 𝑟' 𝜈 ≤ 𝐾 log 𝐾 𝑇 . Ø Consider the worst-case expected simple regret for bandit models with infinite support (location-shift). 30

Lower Bound n Definition: Null bounded strategy. • A strategy
is null bounded if under a bandit model 𝜈 such that 𝜇! 𝜈 = ⋯ = 𝜇? 𝜈 , the strategy draws each treatment as 𝔼: ! ' ∑#,! ' 1 𝐴# = 𝑎 𝑋# = 𝑥 ≥ 𝜖 for some small constant 𝜖 > 0. • Consider a location-shit bandit class 𝒫. • For any null bounded strategy, the expected simple regret is lower bounded as follows: if 𝐾 = 2, sup -∈𝒫 𝑇𝔼- 𝑟, 𝑃 ≥ 𝐶" ∫ 𝜎" 𝑥 + 𝜎& 𝑥 & 𝜁 𝑥 𝑑𝑥 + 𝑜 1 ; if 𝐾 ≥ 3, sup -∈𝒫 𝑇𝔼- 𝑟, 𝑃 ≥ 𝐶& ∑ !∈ : ∫ 𝜎! 𝑥 & 𝜁 𝑥 𝑑𝑥 + 𝑜 1 , where 𝐶" and 𝐶& are some constants. 31 Theorem (Lower bound)

Target Allocation Ratio n The target allocation ratios for minimization
of the expected simple regret is the same as those for minimization of the probability of misidentification. • If 𝐾 = 2, 𝑤∗ 1 𝑋% = 2/(@!) 2/ @! 322(@!) , and 𝑤∗ 2 𝑋% = 22(@!) 2/ @! 322(@!) . • If 𝐾 ≥ 3, 𝑤∗ 𝑎 𝑋% = 2. 2(@!) ∑5∈[6] 25 2(@!) . n Why? → Both settings consider worst-case scenarios. • In the worst case, only the variance affects the target sample allocation. ØThe RS-AIPW strategy proposed in minimization of the probability of misidentification. This strategy is minimax optimal in the sense that its leading factor matches the lower bound. 32

Summary 33

Summary ØAsymptotically optimal strategy in two-armed Gaussian BAI with a
fixed budget. n Evaluating the performance of BAI strategies by the probability of misidentification. • The Neyman allocation rule is globally optimal when the standard deviations are known. = The Neyman allocation is known to be asymptotically optimal when potential outcomes of two treatment arms follow Gaussian distributions with any mean parameters and fixed variances. n Result of Kato, Ariu, Imaizumi, and Qin (2022). • The standard deviations are unknown and estimated during an experiment. • Under the NA-AIPW strategy, the probability of misidentification matches the lower bound when the gap between expected outcomes goes to zero. → The strategy based on the Neyman allocation is the worst-case optimal (small-gap optimal). 34

Reference • Kato, M., Ariu, K., Imaizumi, M., Nomura, M.,
and Qin, C. (2022), “Best Arm Identification with a Fixed Budget under a Small Gap.” • Audibert, J.-Y., Bubeck, S., and Munos, R. (2010), “Best Arm Identification in Multi-Armed Bandits,” in COLT. • Bubeck, S., Munos, R., and Stoltz, G. (2011), “Pure exploration in finitely-armed and continuous-armed bandits,” Theoretical Computer Science. • Carpentier, A. and Locatelli, A. (2016), “Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem,” in COLT • Chen, C.-H., Lin, J., Yücesan, E., and Chick, S. E (2000). Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems, • Garivier, A. and Kaufmann, E. (2016), “Optimal Best Arm Identification with Fixed Confidence,” in COLT. • Glynn, P. and Juneja, S. (2004), “A large deviations perspective on ordinal optimization,” in Proceedings of the 2004 Winter Simulation Conference, IEEE. • Kaufmann, E., Cappé, O., and Garivier, A. (2016), “On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models,” JMLR. • Lai, T. and Robbins, H. (1985), “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics. • Manski, C. F. (2000), ”Identification problems and decisions under ambiguity: Empirical analysis of treatment response and normative analysis of treatment choice,” Journal of Econometrics. - (2002), ”Treatment choice under ambiguity induced by inferential problems,” Journal of Statistical Planning and Inference. - (2004), “Statistical treatment rules for heterogeneous populations,” Econometrica. • Manski, C. F. and Tetenov, A. (2016), “Sufficient trial size to inform clinical practice,” Proceedings of the National Academy of Science. • Stoye, J. (2009), “Minimax regret treatment choice with finite samples,” Journal of Econometrics. 35

Best Arm Identification with a Fixed Budget und...

Best Arm Identification with a Fixed Budget under a Small Gap

More Decks by MasaKat0

Featured

Transcript