Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Counterfactual Machine Learning 入門 / Introducti...

Kazuki Taniguchi
September 29, 2018

Counterfactual Machine Learning 入門 / Introduction to Counterfactual ML

この資料は「第28回Machine Learning 15minutes! 」(https://machine-learning15minutes.connpass.com/event/97195/) で発表した内容になります。

Kazuki Taniguchi

September 29, 2018
Tweet

More Decks by Kazuki Taniguchi

Other Decks in Technology

Transcript

  1. Introduction • ৬छ
 ɹResearch Scientist • ݚڀྖҬ • Basics of

    Machine Learning • Response Prediction • Counterfactual ML • ͜Ε·Ͱͷ࢓ࣄ (ResearchҎ֎) • MLaaSͷ։ൃ • DSPͷΞϧΰϦζϜ։ൃ
  2. Supervised Learning Feature(Context): xi 1 9 8 7 3 2

    Predictions: ̂ yi ̂ yi = f(xi ) 1 9 5 7 3 2 Labels: yi miss correct correct correct correct correct
  3. Interactive Learning [2] Feature(Context): xi ai = π(xi ) Action:

    ai Reward: ri ഑৴໘ Ϣʔβ Click or Not Ϣʔβ
  4. Comparison with Supervised Learning 1 7 Labels Supervised Learning Interactive

    Learning click Counterfactual • બ୒͞Εͳ͔ͬͨΞΫγϣϯͷධՁ͸൓ࣄ࣮ͱͳΔ • ৽͍͠PolicyΛධՁ͢Δࡍ͸൓ࣄ࣮ͷΞΫγϣϯΛධՁͰ͖ͳ͍
  5. Comparison with Contextual Bandit • ໰୊ઃఆ͸ಉ͡ • Counterfactual ML͸Offline (Batch)

    LearningΛϝΠϯʹऔΓѻ͏ • Onlineͱҧ͍ɺධՁ͕ߦ͍΍͍͢఺͕ϝϦοτͱͳΔ • Contextual Bandit͸OnlineͰPolicyΛߋ৽͢Δ • Counterfactual MLͷߟ͑ํ͸Contextual BanditͷPolicyͷ
 ධՁΛ͢Δ͜ͱͱಉ͡ (Offline Evaluation) [3] Evaluationʹ͍ͭͯ͸ʮAI Lab Research Blogʯͷهࣄ[3]ʹ ৄ͘͠ॻ͔Ε͍ͯΔͷͰࠓճͷൃදͰ͸ׂѪ͠·͢
  6. Definitions • Data • Policy D = ((x1 , y1

    , δ1 , p1 ), . . . , (xn , yn , δn , pn )) xi yi δi pi yi = π(xi ) π : Context : Labels (multi-label settings) : Reward : Propensity Score (ޙड़) : Policy (Context → Action)
  7. Counterfactual Risk Minimization • Unbiased Estimation R(π) = 1 n

    n ∑ i=1 δi π(yi |xi ) π0 (yi |xi ) = 1 n n ∑ i=1 δi π(yi |xi ) pi δi π0 : loss : logging policy (→ Propensity Score) Importance sampling R(π) = 1 n n ∑ i=1 min{M, δi π(yi |xi ) pi } clipping (M)Λಋೖͨ͠Լه͕IPS (Inverse Propensity Score) Estimator [4]
  8. Counterfactual Risk Minimization arg min h R(h) + λ Varw

    (u) n • CRM (Counterfactual Risk Minimization) Generalization Error Boundsͷ্ݶΛ࠷খʹ͢Δ ※ৄࡉ͸࿦จΛࢀর data-dependent regularizer
  9. • classificationͱಉ༷ͷpolicy (ઢܗ + softmax) • ҎԼͷࣜͷ௨Γʹֶश POEM [5] π(y|x)

    = exp(wϕ(x, y)) ∑ y′∈Y exp(wϕ(x, y′) w * = arg min w∈Rd ¯ uw + λ Varw (u) n ui w ≡ δi min{M, exp(wϕ(x, y)) pi ∑ y′∈Y exp(wϕ(x, y′) } ¯ uw ≡ n ∑ i=1 ui w Varw (u) ≡ 1 n − 1 n ∑ i=0 (ui w − ¯ uw )2
  10. Experiments • Dataset (multi label experiments) • Supervised to Bandit

    Conversion [6] 5% 95% x y* CRF π0 y ᶃશσʔλͷ5%Ͱlogging policyΛֶश ᶄಘΒΕͨlogging policyͰ95%ͷσʔλʹϥϕϧΛ෇༩ ᶅ feedbackΛyͱy*Ͱܭࢉ
 (Hamming loss) δ
  11. Note • logʹଘࡏ͠ͳ͍ϥϕϧʹରͯ͠͸ਖ਼֬ͳ༧ଌ͸Ͱ͖ͳ͍ ex) ৽͍͠޿ࠂΛ௥Ճ͢Δ࣌ log A B C B

    A OK NG Counterfactual ML ※্هͷྫ͸ۃ୺ͳྫɺՄೳʹ͢Δํ๏΋ଘࡏ͢Δ
  12. More • [5]ͷݚڀνʔϜ͕ܧଓతʹݚڀΛൃද • ”The Self-Normalized Estimator for Counterfactual Learning”

    • “Recommendations as Treatments: Debiasing Learning and Evaluation” • “Unbiased Learning-to-Rank with Biased Feedback” • “Deep Learning with Logged Bandit Feedback” • Microsoft Researchʹ΋ଟ͘ͷݚڀऀ͕ࡏ੶த ڵຯͷ͋Δํ͸ͥͻௐ΂ͯΈ͍ͯͩ͘͞ʂ
  13. AI LabͰ΋ݚڀڧԽத https://arxiv.org/abs/1809.03084 Yusuke Narita, Shota Yasui, Kohei Yata,
 “Efficient

    Counterfactual Learning from Bandit Feedback”, arxiv, 2018 https://adtech.cyberagent.io/ailab/ ࠓճͷ಺༰ʹؔ͢Δ࿦จ ৄ͘͠͸ฐࣾwebαΠτ΁
  14. References 1. SIGIR 2016 Tutorial on Counterfactual Evaluation and Learning


    (http://www.cs.cornell.edu/~adith/CfactSIGIR2016/)
 2. ICML2017: Tutorial on Real World Interactive Learning 
 (http://hunch.net/~rwil/)
 3. όϯσΟοτΞϧΰϦζϜͷධՁͱҼՌਪ࿦
 (https://adtech.cyberagent.io/research/archives/199)
 4. Counterfactual Reasoning and Learning Systems, 2017
 (https://arxiv.org/abs/1209.2355)
 5. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
 (https://arxiv.org/abs/1502.02362)