[DevDojo] Experiments at Mercari - 2025

Experiments @rhomel

Agenda Experimentation • What is experimentation? • Why run experiments?
Implementing Experiments Discussion Problem Session

Scope of this Session on Experimentation Design & Planning Implementation
Execution Analysis Primarily a PM role, however our long term goal is that anyone should be able to run an experiment. Metrics analysis is owned primarily by the Feature Flags Evaluation (FFE) and BI team.

Who is familiar with experiments? (also sometimes called A/B tests)
❓

Who is familiar with experiments? (also sometimes called A/B tests)
Statistical Power A/B Tests Metrics Control & Treatment Confounding Variables SUTVA violation P-Values Scientific Method Statistical Significance Critical Values Sample Ratio Mismatch

This is hard! But our focus today is much simpler.

“Real World” Experimentation

What is experimentation? Galileo's Leaning Tower of Pisa experiment “Galileo
discovered through this experiment that the objects fell with the same acceleration, proving his prediction true, while at the same time disproving Aristotle's theory of gravity (which states that objects fall at speed proportional to their mass).” Original image source: Theresa knott at English Wikibooks., Pisa experiment, modiﬁed with vector arrows, CC BY-SA 3.0

Experimentation at .

Mercari Example Personal Inventory - Hiding Prices (1) Control: show
both total inventory price and individual item prices (2) Treatment: hide total inventory price ② (3) Treatment: hide both prices ① ② ① ②

What is experimentation at Mercari? Hypothesis Experiment Analysis A proposed
explanation of why a particular observation occurs. A repeatable method to validate or invalidate a given hypothesis. What we can learn from our observations.

Do you think ChatGPT is correct? (yes/no) Which answer do
you think is the most correct? 1: “A way to increase GMV and therefore Mercari’s stock price.” Your DevDojo presenter: @rhomel 2: “An empirical method of acquiring knowledge.” Wikipedia’s description of “Scientiﬁc Method” https://en.wikipedia.org/wiki/Scientiﬁc_method 3: “Establish a data-driven culture that informs rather than relies on the HiPPO (Highest Paid Person’s Opinion)” Published Book: “Trustworthy Online Controlled Experiments” Ron Kohavi (Amazon, Microsoft), Diane Tang (Google), Ya Xu (LinkedIn) 4: “Compare variations of the same web page to determine which will generate the best outcomes.” Nielsen Norman User Research Group on AB testing https://www.nngroup.com/articles/ab-testing-and-ux-research/ Why Run Experiments? 👾ChatGPT (3.5 and 4) ❓

Which answer do you think is the most correct? 1:
“A way to increase GMV and therefore Mercari’s stock price.” This is the end result that we hope to achieve for our company. 2: “An empirical method of acquiring knowledge.” This is the purpose of experimentation. It is the best answer because it creates an environment where the other answers become potential outcomes. 3: “Establish a data-driven culture that informs rather than relies on the HiPPO (Highest Paid Person’s Opinion)” This is one of the effects of understanding the importance of experimentation. 4: “Compare variations of the same web page to determine which will generate the best outcomes.” This is a tendency of how we conduct our experiments. Why Run Experiments?

Recap…

“An empirical method of acquiring knowledge.” What we can learn
from our observations. Apply new knowledge about our service and marketplace. Why Run Experiments? But in order to learn anything, we need to establish causality. Simply ﬁnding correlations in data is not enough.

Can we learn from correlations? Original chart source unmodiﬁed from:
tylervigen.com/spurious-correlations, CC BY 4.0 Data sources: U.S. Department of Agriculture and National Science Foundation Correlation vs Causality

Can we learn from correlations? Should we increase app crashes
to increase purchases? *Don’t worry, this is made-up data. Correlation vs Causality What might cause this correlation?

Correlations do not establish causality.

Counterfactual We cannot know the counterfactual result of a treatment.
Additionally there is some stochastic behavior between participants. But we can use randomized controlled experiments to gather enough statistical evidence that a treatment causes a particular effect within a population. A B A B T Counterfactual is the result had the treatment not been applied. For example suppose we give a treatment such as a medicine to relieve a patient’s headache. Since this participant is part of the treatment group, we cannot know what would have happened to them had they received the placebo treatment instead of the actual treatment.

Use randomized controlled experiments to gather enough statistical evidence that
a treatment causes a particular effect within a population a d C a e T If we assume the null hypothesis that d=e, then after applying treatment T to a similar random population as the control there should be no measured difference between both groups.

What about other companies?

Implementing Experiments

Experiment Design is a description of our hypothesis and how
we plan on using our experiment to prove or disprove it. Common components of an Experiment Design Document at Mercari: • Hypothesis • Variables and their associated treatments • Metrics ◦ Goal Metrics ◦ Guardrail Metrics • Analysis • Conclusions and Future Actions Experiment Design In addition to reviewing the variants and treatments, also carefully look at the suggested guardrail metrics. For example certain feature interactions within the code base can lead to undesirable service incidents. Often PMs will prepare Experiment Design Docs 󰱣

A/A Test Before running actual experiments we should always attempt
to run an A/A test. An A/A test is simply collecting metrics from the randomized samples and verifying that the two or more groups are not biased in any manner. During this period the users should experience no actual variation in experience, so they will not notice any changes. Common problems A/A tests help with are: • Sample Ratio Mismatch (SRM): The A/A test can serve as a preliminary measurement of whether group sample sizes are equivalent. Without equally sized samples, statistical comparisons cannot provide meaningful conclusions because there may likely be a factor that is causing the mismatch. • Detecting Group Biases: Some random samples from the population may be biased due to the nature of our service. For example some users may be “power users” such as a seller using multiple devices (and likely multiple employees) to interact with buyers. These power users likely do not represent the average Mercari user and if they happen to be included in a sample they will likely skew the metrics for their group. Engineers may need to confirm if an A/A test is possible to run on the desired experiment design configuration 󰱢

Technical Details To Know About Experimentation Platform (V2)

How Users Are Assigned

Experiment and Treatment Assignment Clients apply treatments based on the
parameter and value configuration. When the user experiences the treatment, the client logs an event containing the treatment and experiment ids. { "experimentId": "xxx", "treatmentId": "xxx", "parameter": "first-lister-tutorial", "value": "video" }, { "experimentId": "xxx", "treatmentId": "xxx", "parameter": "first-lister-tutorial-video-path", "value": "/content/tutorial.mp4" } Why don’t we base the treatment based off of the treatment id? It would lock features to a single experiment. With parameters we can have many experiments with the same client configuration.

How Populations Are Distributed to Experiments • Traffic/Requests are partitioned
by a tree of alternating Domains and Layers • Each Layer owns a set of Parameters • We create Experiments on Layers • Experiments are limited to their parent Layer’s parameters • At each step (Domain, Layer, Experiment treatments) Users are randomly distributed to buckets • 100% Domain ◦ Search Layer ▪ 40% Improvements Domain • Improvement “A” Layer ◦ Experiment E: keyword suggestions • Improvement “B” Layer ◦ Experiment F: category suggestions ▪ 40% Alternate Algorithms Domain • Algorithm Trial “X” Layer ◦ Experiment 1 • Algorithm Trial “Y” Layer ◦ Experiment 2: index optimization trial • Algorithm Trial “Z” Layer ◦ Experiment 3: LLM reword trial ▪ 20% Holdout Domain ◦ …Other Layers… Example Subtree:

Randomization Unit and Targeting Each user can be randomized either
by their User ID or by random UUIDs for an undetermined lifespan. UUIDs apply to all users including logged out or unregistered users. Experiments must be targeted at specific advertised client properties. Example: platform == android, version >= 9000

User Overrides 🔨 You can use user overrides to: •
Force a user or uuid to receive a particular treatment’s parameter assignments • Force a user out of all randomization. These are called exclusive overrides. ◦ For this case you can define the exact set parameter assignments (if any) the user should receive instead. Note that override assignments do not receive an experiment or treatment id. And therefore will likely not appear in event logs.

Client Libraries: Defaults 📚 For all uses of feature flags
and experiments, we must be careful to always have a valid default. In other words our clients and services should still function even if the experimentation platform API is unresponsive.

Client Libraries: Caching 📚 Caching Behavior: • Web : values
will be cached to localstorage. On repeated reloads cached values will be used if the network request fails. • iOS / Android: successfully fetched parameter values will be saved to disk. If the user restarts the app and the network request for updated experiments fails or times out, then the cached value is utilized. Performance Considerations: To ensure a responsive experience for the user, some trade-offs have been made. One specific trade-off is short timeouts for network requests during app startup. The experimentation platform API requests are part of these startup requests so request timeout may occur more often than assumed.

Event Logging • For Web, iOS, and Android, the client
library already provides a logging facility. • Even with libraries we all need to consider the appropriate logging timing based on the intended experiment configuration. • Logging has a cost 💰 ◦ Carefully consider which logs are necessary. ◦ Carefully consider if there are failure modes that necessitate Error logs.

Importance of Logging Timing The timing of logs plays a
very important role in how we can interpret the results of an experiment. We must be careful to log events when the experiment participant experiences the treatment. Example: suppose we have an idea for a special feature that is only activated when the concentration of sold-out items displayed on the search results reaches a particular threshold. The feature will perform an additional related search query when this condition is met and interleave successive results with the secondary query. The secondary query is performed by the client, so to keep performance equivalent both groups (control and treatment) will perform the query. But we will only display the interleaved results for the treatment group.

Importance of Logging Timing The client has psuedo code similar
to the following: fetchSearchResults(): searchResults = requestSearchAPI() performAdditionalQuery = getExperimentConfigParameter(“search-related-items-when-sold-out”) if highSoldOutItemConcentration() and performAdditionalQuery == “query-only” or performAdditionalQuery == “query-and-interleave” fetchRelatedItems() if performAdditionalQuery == “query-and-interleave” interleaveResults(searchResults) // mutates the searchResults to include the related items showResults(searchResults) Question: At what point should the client log when the user was exposed to the treatment? 1. after getExperimentConfigParameter(“search-related-items-when-sold-out”) returns a value 2. after fetchRelatedItems succeeds 3. after interleaveResults 4. inside showResults ❓

Importance of Logging Timing fetchSearchResults(): searchResults = requestSearchAPI() performAdditionalQuery =
getExperimentConfigParameter(“search-related-items-when-sold-out”) if highSoldOutItemConcentration() and performAdditionalQuery == “query-only” or performAdditionalQuery == “query-and-interleave” fetchRelatedItems() if performAdditionalQuery == “query-and-interleave” interleaveResults(searchResults) // mutates the searchResults to include the related items showResults(searchResults) Question: At what point should the client log when the user was exposed to the treatment? 1. after getExperimentConfigParameter(“search-related-items-when-sold-out”) returns a value 2. after fetchRelatedItems succeeds 3. after interleaveResults 4. inside showResults For this experiment setup we should log when the results are actually visible to the user. If we do not do so, any operation before the results are visible may cause us to log the treatment prematurely. Consider that our interleaveResults function has a bug that crashes the app or our fetchRelatedItems fails but does not interrupt the render call, then our logging is likely incorrect. These bugs will affect our metrics collection and lead to likely incorrect decisions later. Note: actual logs will also require the received experiment id and treatment id to be logged. Full Spec.

Confirming Client Behavior Other than manually testing client behavior, you
may sometimes want to debug issues and identify if a particular user is actually receiving the treatment assignment you expect. There are two log sources can use to examine the applied assignments: - Event logs - depends on the client implementation to correctly send event logs to Laplace - integrations for iOS, Android, Web - usually means the client applied the parameter assignment - Experiment API access logs - captures each client API request - does not mean the client applied the parameter assignment, but indicates the request and response were sent

Appendix

Trunk Based Development: https://trunkbaseddevelopment.com/ Trustworthy Online Controlled Experiments https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59 References

[DevDojo] Experiments at Mercari - 2025

[DevDojo] Experiments at Mercari - 2025

More Decks by mercari

Other Decks in Technology

Featured

Transcript