Improving Search @scale with efficient query experimentation @BerlinBuzzwords 2024

IMPROVING SEARCH @scale with eﬃcient query experimentation. from relevance
guesswork to granular search experimentation

I’m ANDY Founder & CTO @searchhub.io Deep passion for improving
Product Discovery (Search & Recs) for over 2 decades.

IT’S THE AUTOPILOT FOR SEARCH It consistently and incrementally improves
your search, driven by data and experimentation and works with any search engine WHAT IS ? we currently monitor & enhance approximately 60 billion searches, generating around €3.5 billion in annual revenue for our clients. visit searchhub.io

WARNING today, I’m not going to talk about Gen-AI, Vector-Search
or LLMs

06 INSTEAD I’M GOING TO TALK ABOUT. Improving SEARCH over
time — framing the problem. SEARCH Experimentation / Testing. Sketching out a potential solution. Obstacles along the way. 01 Metrics to choose. Is it worth it? 02 03 04 05

WHAT DOES IT MEAN TO IMPROVE SEARCH @scale 01

WHAT PAPERS TELL YOU By plugging in technology/approach (X)
you’ll see an improvement of (~ Y)% It’s pretty easy to improve search in a snapshot scenario (fixed parameters) for a single point in time. 01

WHAT REALITY TEACHES YOU ❏ Indexed documents change over time
❏ Incoming queries change over time ❏ Configurations change over time (Syns, Boostings, Rewrites) ❏ Features change over time You face different VERSIONS of your SEARCH over time CONFIG INDEX QUERIES CONFIG CONFIG INDEX QUERIES CONFIG CONFIG INDEX QUERIES CONFIG CONFIG INDEX QUERIES CONFIG TIME 01

DO CHANGES IMPROVE OR DECREASE SEARCH QUALITY OVER TIME? Local
Optimum Global Optimum 01

WHEN REALITY HITS YOU results of results of product-type: laptop
T0 syn: laptop ~ notebook T1: new assortment: laptops laptops product-type: laptop OR product-type: sketchbook 01

WHEN REALITY HITS YOU finds …/shoes/ T0 redirect: sneakers T1:
improved retrieval sneakers sneakers 01

WHEN REALITY HITS YOU T0: keyword search T1: vector search
dark chair with velvet cover without rotation function dark chair with velvet cover without rotation function 01

THOSE WHO CAN NOT LEARN FROM HISTORY ARE DOOMED TO
REPEAT IT. ― GEORGE SANTAYANA

Measuring the impact of changes made to the retrieval
system over time. SEARCH EXPERIMENTATION 02

WHAT THE MARKET TELLS YOU You’ll be able to
incrementally improve your search by plugging in an Experimentation / Multivariate Testing Stack (X) Really? - let’s think about this for a bit 02

WHAT REALITY TEACHES YOU USERS OR SESSIONS What is our
randomization unit? 50% 50% Almost any experimentation system out there tries to randomize on users with a session fallback. But how can we judge query-based changes on a user or session level? 02

WHAT REALITY TEACHES YOU USERS/SESSIONS vs. QUERIES Are the underlying
distributions norm.dist? Users and Sessions are, if the randomization function is properly designed. Queries are not, as their likelihood follows a zipfian-distribution 02

WHAT'S THE PROBLEM? If we split perfectly, our USERS/SESSIONS in
2 groups. How many queries could we experiment with 02 we lose at least 60% of the initial opportunity

WHAT WE LEARN Can we guarantee variant equality (No SRM)?
For Users and Sessions - YES For Queries - NO For changes with near global impact, classic User/Session — based splitting works quite well. For query dependent changes, we CANNOT. One way around this is query — based splitting. Same randomization method but different unit 02

DEPENDING ON THE TYPE OF CHANGE, WE NEED TO CHOOSE
THE RIGHT RANDOMIZATION UNIT

Which metrics could significantly improve search over time 
METRICS TO USE 03

WHAT MANAGERS WANT North Star Metrics like CR AND/OR ARPU
Show me significant uplifts in Problems with North Star Metrics 1) low sensitivity of the north star metric 2) differences between the short-term and long-term impact on the north star metric. Pareto Optimal Proxy Metrics 03

WHAT IS LEAST FRAGILE Showing significant uplift in positive direct
proxy metrics is a lot easier and often less error-prone. SEARCH-PATH / MICRO-SESSIONS to the rescue CTR ~ ATBR ~ ABRPQ ~ Avg-added-Basket- Revenue-per-Query Add-to-Basket-rate Click-through-rate KPIs with SEARCH-PATH scope 03 Assign events (clicks,carts,..) to triggers (query) probabilistically

IT’S BETTER TO HAVE A CLEAN DIRECT PROXY-METRIC THAN A
FUZZY NORTH STAR METRIC

Measuring the impact of changes made to the retrieval
system over time. SKETCHING OUT A POTENTIAL SOLUTION 04

BUILDING-BLOCKS for Search-Experimentation Identifying Experiments Experiment Design Experimentation Platform Which
changes and metrics will be selected for an Experiment? How will the design of the experiment setup look? How will the experiments be managed and evaluated? 04

BUILDING-BLOCKS Identifying Experiments Which changes and metrics will be selected
for an experiment? When dealing with millions of queries and thousands of alterations, a (semi)-automatic system for identifying experiments is essential. Identification needs to be supported by observations of how results and user behavior evolve. For efficiency it's also crucial to filter identified experiments 04 Additionally, these observations help distinguish between global shifts and query-specific variations.

BUILDING-BLOCKS Experiment Design How will the design of the experiment
setup look? Choose the right type of randomization unit based on the classification of global changes vs. query dependent changes. Use expected data and prior data, to initialize your experiments. Reduce external influences and maintain minimal sparseness. We use direct proxy metrics like CTR, ATBR, ABRPQ. 04

BUILDING-BLOCKS Experimentation Platform How will the experiments be managed and
evaluated? We need a system that automatically stores, starts, evaluates and ends experiments. Ideally, we would employ various evaluation methods tailored to different types of metric distributions (such as binomial or non-binomial) and generate experiment results accordingly. 04

so we just have to PUT IT ALL TOGETHER throw
PROD-DATA at it and CELEBRATE WELL…

Prepare for trouble. OBSTACLES ALONG THE WAY 05

SEARCH DATA IS SPARSE Problem It should be obvious, but
it’s shocking how sparse even direct KPIs are, like Searches, CTR and ATBR. Solutions ❏ Aggregate by search-intent ❏ Find methods that provide valid experiment evaluations with less data. 05

IMPROVING SPARSENESS — 1 Solutions Aggregate by search-intent. Depending on
the coverage/quality, this can reduce the sparseness around 10-33%. Without aggregation With aggregation 05 share of total unique queries (%) share of total query frequency (%)

IMPROVING SPARSENESS — 2 Solutions Find statistical methods that provide
valid experiment evaluations with less data. (Group Sequential Testing) We have specifically tuned the Lan DeMets functions with ML to maximize sample-size efficiency in terms of experiment abortion. 05

BE AWARE OF STATISTICAL POWER PROBLEM With small sample sizes,
controlling statistical power (TYPE-2 Error) gets more important. Solution Do not use post-experiment power, instead model in the error by simulation and adjust p-Value accordingly. A/B Testing Intuition Busters 05

SEARCH DATA IS UNSTABLE Problem Search KPIs are very sensitive
to trends and seasonality, this makes the data unstable (high variance) Solutions Cap Experiment runtime (max 28 days) AND reduce variance via CUPED or other similar methods. CUPED — Improving Sensitivity Of Controlled Experiments by Utilizing Pre-Experiment Data KPI Samples needed to detect 5% uplift without CUPED Samples needed to detect 5% uplift with CUPED SAVINGS CTR ~25k ~18k ~30% ATBR ~73k ~23k ~69% 05

SEARCH DATA IS IMBALANCED Problem Sample sizes are often very
imbalanced which increases minimum sample size and variance. Solutions Use the whole Search Funnel to guide and influence the user query distribution. Query Suggestions, Query Recommendations We can use them to balance sample sizes by actively promoting them. 05 Query Unique Searches laptop 2475 notebook 4159

SEARCH DATA IS IMBALANCED Query Suggestions Promote the variant with
too little traffic in your auto-suggestions to reduce the imbalance. Query Recommendations Promote the variant with too little traffic in your query recommendations (others searched for) to reduce the imbalance. 05

NOT EVERY EXPERIMENT MAKES SENSE Even if data shows that
an Experiment has great potential — this could dramatically change over time. Implement FAST-EXITS for cases like: ❏ Zero-Result Queries ❏ Marketing Campaigns ❏ Seasonal Changes 05 Problem Solution

IS SIGNIFICANCE ALL WE CARE ABOUT? Experimentation also involves managing
risk. If an observed effect isn't statistically significant, but the likelihood of it being better is substantial and the risk of being much worse is minimal, why not give it a try? Problem Solutions 05

Is incremental automated experimentation in SEARCH really worth it?
 IS IT WORTH IT? 06

SOME FACTS from PROD success-rate we finished ~1900 8.51% +16.1%
avg. Treatment Effect Experiments last week for our customers with a and a 06

WHY YOU SHOULD CARE Leveraging the power of tiny gains
In the past five months since its launch, Search Experimentation (Query Testing) has consistently boosted the overall weekly search KPIs by approximately 0.49% compared to the hold-out-variant. While this may seem limited, it accumulates to an impressive overall enhancement of 9.87% over the entire period, without any signs of decline.(and seasonal effect)

WORK IN PROGRESS Communicating Experiment Results Together with our customers,
we are still figuring out the best way to communicate experimental results to maximize impact. Unfortunately the interpretation of these results is not always straightforward: ❏ Contradicting KPIs (CTR improves, ATBR does not) ❏ Effect not significant but close and potential very high (Risk Management)

WORK IN PROGRESS

NOW GO OUT AND MAKE SEARCH EXPERIMENTATION WORK FOR YOUR
ORGANIZATION AS WELL

THANK YOU! Questions? Talk to me and stalk me here
CREDITS Presentation Template: SlidesMania Icons: Flaticon

Improving Search @scale with efficient query ex...

Improving Search @scale with efficient query experimentation @BerlinBuzzwords 2024

More Decks by Andreas Wagner

Other Decks in Science

Featured

Transcript