Upgrade to PRO for Only $50/YearโLimited-Time Offer! ๐ฅ
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
how_to_ab_test_with_confidence_railsconf.pdf
Search
Frederick Cheung
April 13, 2021
Programming
0
64
ย how_to_ab_test_with_confidence_railsconf.pdf
Frederick Cheung
April 13, 2021
Tweet
Share
More Decks by Frederick Cheung
See All by Frederick Cheung
Fixing Performance and Memory Problems (RubyWine)
fcheung
0
63
Fixing Performance and Memory Problems
fcheung
2
530
Asking questions
fcheung
0
64
Extending Ruby
fcheung
1
480
Introduction to Version Control
fcheung
0
84
Other Decks in Programming
See All in Programming
AIใจใผใธใงใณใใๆดปใใPM่ก AI้งๅ้็บใฎ็พๅ ดใใ
gyuta
0
370
251126 TestState APIใฃใฆใชใใใใฃใ?Step Functionsใในใใจใใๅคใใ?
east_takumi
0
310
C-Shared Buildใง็ช็ ดใใAI Agent ใใใฏใในใใฎๅฃ
po3rin
0
380
DSPy Meetup Tokyo #1 - ใฏใใใฆใฎDSPy
masahiro_nishimi
1
160
ๅ จๅกใขใผใญใใฏใใงๆใใ ๅทจๅคงใง้ซๅฏๅบฆใชใใกใคใณใฎ็ด่งฃใๆน
agatan
8
20k
JETLS.jl โ A New Language Server for Julia
abap34
1
330
้ขๆฐๅฎ่กใฎ่ฃๅดใงใฏไฝใ่ตทใใฆใใใฎใ๏ผ
minop1205
1
680
AIใณใผใใฃใณใฐใจใผใธใงใณใ๏ผGemini๏ผ
kondai24
0
200
ๆใ่ถณใใชใ๏ผๅ ผๆฅญใใผใฟใจใณใธใใขใซๅฟ ่ฆใ ใฃใใขใผใญใใฏใใฃใจ็ซใกๅใ
zinkosuke
0
600
connect-python: convenient protobuf RPC for Python
anuraaga
0
390
LLM รaฤฤฑnda Backend Olmak: 10 Milyon Prompt'u Milisaniyede Sorgulamak
selcukusta
0
110
20251127_ใผใฃใกใฎใใใฎๆ่ฆชไผๅฏพ็ญไผ่ญฐ
kokamoto01_metaps
2
420
Featured
See All Featured
Testing 201, or: Great Expectations
jmmastey
46
7.8k
Scaling GitHub
holman
464
140k
The Pragmatic Product Professional
lauravandoore
37
7.1k
Six Lessons from altMBA
skipperchong
29
4.1k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
61k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.6k
What's in a price? How to price your products and services
michaelherold
246
12k
Balancing Empowerment & Direction
lara
5
790
Leading Effective Engineering Teams in the AI Era
addyosmani
8
1.3k
The Language of Interfaces
destraynor
162
25k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
130k
Transcript
How to A/B Test with con fi dence @fglc2 Photo
by Ivan Aleksic on Unsplash
None
The Plan โข Intro: What's an A/B Test? โข Test
setup errors โข Errors during the test โข Test analysis errors โข Best practices Photo by Javier Allegue Barros on Unsplash
What is an A/B test?
Buy Now Order Or
๐ง๐๐๐๐ง๐งโ๐จ๐พ๐ฉ๐ผ๐๐ง๐จ ๐ง๐ค๐ฉ๐ผ๐๐ท๐๐ฉ๐ญ๐ต๐๐ง๐๐ง ๐จ๐๐๐จ๐ญ๐๐ฉ๐พ๐ง๐งโ๐๐ง๐ ๐๐ต๐ฉ๐ญ๐จ๐๐๐ง๐จ๐ฆฑ๐ฐ๐จ๐๐ต ๐ฉ๐ง๐ง๐๐ฉ๐๐ง๐จ๐๐ฅท๐ง๐ญ๐ง๐งโ๐ง
๐๐จ๐ญ๐๐๐ง๐ง๐ง ๐ฉ๐ญ๐จ๐๐ง๐ฉ๐ผ๐๐ฐ๐จ๐ ๐ต๐ง๐๐ง๐จ๐ง๐ค๐ง๐จ๐ ๐๐๐จ๐พ๐ฉ๐ญ ๐ต๐ฉ๐๐ง๐จ๐๐จ๐ฆฑ๐งโ๐ฉ๐ง ๐ต๐ฅท๐ง๐ญ๐งโ๐ฉ๐พ๐ฉ๐ผ๐ท ๐๐๐ง๐๐๐ง๐๐งโ๐ ๐ง๐ง๐๐ Buy Now
Order 49 orders 56 orders
Is the difference real?
โข Layouts / designs / fl ows โข Algorithms (eg
recommendation engines) โข Anything where you can measure a di ff erence Not just buttons!
Jargon
Signi fi cance โข Is the observed di ff erence
is just noise? โข p value of 0.05 = 5% chance itโs a fl uke โข The statistical test depends on the type of metric โข No guarantees on the magnitude of the di ff erence
Test power Photo by Michael Longmire on Unsplash Test power
Test power โข How small a change do I want
to detect? โข 10% to 20% is much easier to measure than 0.1% to 0.2%
Sample size โข Check this is feasible! โข Ideally you
donโt look / change anything until sample size reached โข Be wary of very short experiments
Bayesian A/B testing
Bayesian A/B testing
Bayesian A/B testing โข Allows you to model your existing
knowledge & uncertainties โข Can be better at with low base rates โข The underlying maths are a bit more complicated
Test setup errors
Group Randomisation Photo by Macau Photo Agency on Unsplash
class User < ActiveRecord::Base def ab_group if id % 2
== 0 'experiment' else 'control' end end end
class User < ActiveRecord::Base def ab_group(experiment) hash = Digest::SHA1.hexdigest( โ#{experiment}-#{id}"
).to_i(16) if hash % 2 == 0 'experiment' else 'control' end end end
Non random split โข Newer users in other group โข
Older users in one group โข New users were less loyal!
Starting too early
Home Page 50,000 Users Home Page 50,000 Users
30,000 Users 30,000 Users Home Page 50,000 Users Home Page
50,000 Users
15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page
50,000 Users Home Page 50,000 Users
Checkout Page A Checkout Page B 5,000 Users 5,000 Users
15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page 50,000 Users Home Page 50,000 Users
2600 conversions 2500 conversions Checkout Page A Checkout Page B
5,000 Users 5,000 Users 15,000 Users 15,000 Users 30,000 Users 30,000 Users Home Page 50,000 Users Home Page 50,000 Users
2600 conversions 2500 conversions Home Page 100,000 Users 60,000 Users
30,000 Users Checkout Page A Checkout Page B 5,000 Users 5,000 Users
Not agreeing setup โข Scope of the test (what pages,
users, countries ...) โข What is the goal? How do we measure it? โข Agree *one* metric
Errors during the test Photo by Sarah Kilian on Unsplash
A test measures the impact of all differences
Ecommerce Service Recommendation Service
Ecommerce Service Recommendation Service 10x more crashes
Repeated signi fi cance testing โข Invalidates signi fi cance
calculation โข Di ffi cult to resist! โข Stick to your Sample Size โข This is fi ne with Bayesian A/B testing
Test analysis errors Photo by Isaac Smith on Unsplash
Do the maths โข Use the appropriate statistical test โข
Signi fi cance on one metric does not imply signi fi cance on another
Outliers Photo by Ministerie van Buitenlandse Zaken
Photo by Ministerie van Buitenlandse Zaken
Photo by Ministerie van Buitenlandse Zaken
Understanding the domain
-4 -3 -2 -1 0 week 1 week 2 week
3
-4 -2 0 2 4 6 8 week 1 week
2 week 3 week 4 week 5 week 6 week 7
Results splitting
๐ฐ
๐ฐ
We aren't neutral
If the result is 'right' ๐
If the result is 'wrong' โข Start looking at result
splits โข Start digging for potential errors โข Hey what about this other metric โข Well documented test can help
Best practices Photo by SpaceX on Unsplash
Don't reinvent the wheel โข Split, Vanity gems do a
good job โข Consider platforms (Optimizely, Google Optimize) โข But understand your tool, drawbacks
Resist the urge to check/tinker โข Repeated signi fi cance
testing โข Changing the test while it is running (restart the test if necessary)
A/A tests โข Do the full process but with no
di ff erence between the variants โข Allows you to practise
Be wary of overtesting โข Let's test everything! โข Can
be paralysing/time consuming โข Not a substitute for vision / talking to your users
Document your test โข Metric (inc. outliers etc.) โข Success
criteria โข Scope โข Sample size / test power โข Signi fi cance calculation/process โข Meaningful variant names
Thank you! @fglc2
Further Reading โข https://www.evanmiller.org/how-not-to-run-an-ab-test.html โข https://making.lyst.com/bayesian-calculator/ โข https://www.chrisstucchio.com/blog/2014/ bayesian_ab_decision_rule.html @fglc2