Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Search @scale with efficient query experimentation @BerlinBuzzwords 2024

Improving Search @scale with efficient query experimentation @BerlinBuzzwords 2024

Session Abstract
From relevance guesswork to granular search experimentation at scale. Evaluate modifications to your search, such as incorporating synonyms, adjusting field-boosts, adding vector search, or product assortment modifications, on real data rather than relying solely on intuition.

Session Description
Measuring the effectiveness of changes in search functionality, whether significant or minor, remains a persistent challenge. Even experienced teams often rely on search relevance feedback labels to gauge outcomes. While Web-scale Experimentation, such as A/B or Multivariate Testing, is widely practiced, only a few companies utilize it to enhance their search systems incrementally on a query-level basis. A/B Testing in the realm of search presents unique complexities. It involves various considerations and lacks a universal approach. Challenges include selecting appropriate metrics for evaluation, determining suitable randomization units, and addressing imbalanced and sparse experimental data to establish precise decision boundaries, especially with limited data availability. Over time, we've developed, refined, and operated our "Query Testing" capability. I aim to share insights gleaned from this journey.

Andreas Wagner

June 14, 2024

More Decks by Andreas Wagner

Other Decks in Science


  1. I’m ANDY <p>Founder & CTO @searchhub.io</p> Deep passion for improving

    Product Discovery (Search & Recs) for over 2 decades.
  2. IT’S THE AUTOPILOT FOR SEARCH It consistently and incrementally improves

    your search, driven by data and experimentation and works with any search engine WHAT IS ? we currently monitor & enhance approximately 60 billion searches, generating around €3.5 billion in annual revenue for our clients. <p> visit searchhub.io </p>

    time — framing the problem. SEARCH Experimentation / Testing. Sketching out a potential solution. Obstacles along the way. 01 Metrics to choose. Is it worth it? 02 03 04 05
  4. WHAT PAPERS TELL YOU <p> By plugging in technology/approach (X)

    you’ll see an improvement of (~ Y)% </p> It’s pretty easy to improve search in a snapshot scenario (fixed parameters) for a single point in time. 01
  5. WHAT REALITY TEACHES YOU ❏ Indexed documents change over time

    ❏ Incoming queries change over time ❏ Configurations change over time (Syns, Boostings, Rewrites) ❏ Features change over time You face different VERSIONS of your SEARCH over time CONFIG INDEX QUERIES CONFIG CONFIG INDEX QUERIES CONFIG CONFIG INDEX QUERIES CONFIG CONFIG INDEX QUERIES CONFIG TIME 01
  6. WHEN REALITY HITS YOU results of results of product-type: laptop

    T0 syn: laptop ~ notebook T1: new assortment: laptops laptops product-type: laptop OR product-type: sketchbook 01
  7. WHEN REALITY HITS YOU T0: keyword search T1: vector search

    dark chair with velvet cover without rotation function dark chair with velvet cover without rotation function 01

  9. <p> Measuring the impact of changes made to the retrieval

    system over time. </p> SEARCH EXPERIMENTATION 02
  10. WHAT THE MARKET TELLS YOU <p> You’ll be able to

    incrementally improve your search by plugging in an Experimentation / Multivariate Testing Stack (X) </p> Really? - let’s think about this for a bit 02

    randomization unit? 50% 50% Almost any experimentation system out there tries to randomize on users with a session fallback. But how can we judge query-based changes on a user or session level? 02

    distributions norm.dist? Users and Sessions are, if the randomization function is properly designed. Queries are not, as their likelihood follows a zipfian-distribution 02
  13. WHAT'S THE PROBLEM? If we split perfectly, our USERS/SESSIONS in

    2 groups. How many queries could we experiment with 02 we lose at least 60% of the initial opportunity
  14. WHAT WE LEARN Can we guarantee variant equality (No SRM)?

    For Users and Sessions - YES For Queries - NO For changes with near global impact, classic User/Session — based splitting works quite well. For query dependent changes, we CANNOT. One way around this is query — based splitting. Same randomization method but different unit 02
  15. WHAT MANAGERS WANT North Star Metrics like CR AND/OR ARPU

    Show me significant uplifts in Problems with North Star Metrics 1) low sensitivity of the north star metric 2) differences between the short-term and long-term impact on the north star metric. Pareto Optimal Proxy Metrics 03
  16. WHAT IS LEAST FRAGILE Showing significant uplift in positive direct

    proxy metrics is a lot easier and often less error-prone. SEARCH-PATH / MICRO-SESSIONS to the rescue CTR ~ ATBR ~ ABRPQ ~ Avg-added-Basket- Revenue-per-Query Add-to-Basket-rate Click-through-rate KPIs with SEARCH-PATH scope 03 Assign events (clicks,carts,..) to triggers (query) probabilistically
  17. <p> Measuring the impact of changes made to the retrieval

    system over time. </p> SKETCHING OUT A POTENTIAL SOLUTION 04
  18. BUILDING-BLOCKS for Search-Experimentation Identifying Experiments Experiment Design Experimentation Platform Which

    changes and metrics will be selected for an Experiment? How will the design of the experiment setup look? How will the experiments be managed and evaluated? 04
  19. BUILDING-BLOCKS Identifying Experiments Which changes and metrics will be selected

    for an experiment? When dealing with millions of queries and thousands of alterations, a (semi)-automatic system for identifying experiments is essential. Identification needs to be supported by observations of how results and user behavior evolve. For efficiency it's also crucial to filter identified experiments 04 Additionally, these observations help distinguish between global shifts and query-specific variations.
  20. BUILDING-BLOCKS Experiment Design How will the design of the experiment

    setup look? Choose the right type of randomization unit based on the classification of global changes vs. query dependent changes. Use expected data and prior data, to initialize your experiments. Reduce external influences and maintain minimal sparseness. We use direct proxy metrics like CTR, ATBR, ABRPQ. 04
  21. BUILDING-BLOCKS Experimentation Platform How will the experiments be managed and

    evaluated? We need a system that automatically stores, starts, evaluates and ends experiments. Ideally, we would employ various evaluation methods tailored to different types of metric distributions (such as binomial or non-binomial) and generate experiment results accordingly. 04
  22. so we just have to PUT IT ALL TOGETHER throw

  23. SEARCH DATA IS SPARSE Problem It should be obvious, but

    it’s shocking how sparse even direct KPIs are, like Searches, CTR and ATBR. Solutions ❏ Aggregate by search-intent ❏ Find methods that provide valid experiment evaluations with less data. 05
  24. IMPROVING SPARSENESS — 1 Solutions Aggregate by search-intent. Depending on

    the coverage/quality, this can reduce the sparseness around 10-33%. Without aggregation With aggregation 05 share of total unique queries (%) share of total query frequency (%)
  25. IMPROVING SPARSENESS — 2 Solutions Find statistical methods that provide

    valid experiment evaluations with less data. (Group Sequential Testing) We have specifically tuned the Lan DeMets functions with ML to maximize sample-size efficiency in terms of experiment abortion. 05
  26. BE AWARE OF STATISTICAL POWER PROBLEM With small sample sizes,

    controlling statistical power (TYPE-2 Error) gets more important. Solution Do not use post-experiment power, instead model in the error by simulation and adjust p-Value accordingly. A/B Testing Intuition Busters 05
  27. SEARCH DATA IS UNSTABLE Problem Search KPIs are very sensitive

    to trends and seasonality, this makes the data unstable (high variance) Solutions Cap Experiment runtime (max 28 days) AND reduce variance via CUPED or other similar methods. CUPED — Improving Sensitivity Of Controlled Experiments by Utilizing Pre-Experiment Data KPI Samples needed to detect 5% uplift without CUPED Samples needed to detect 5% uplift with CUPED SAVINGS CTR ~25k ~18k ~30% ATBR ~73k ~23k ~69% 05
  28. SEARCH DATA IS IMBALANCED Problem Sample sizes are often very

    imbalanced which increases minimum sample size and variance. Solutions Use the whole Search Funnel to guide and influence the user query distribution. Query Suggestions, Query Recommendations We can use them to balance sample sizes by actively promoting them. 05 Query Unique Searches laptop 2475 notebook 4159
  29. SEARCH DATA IS IMBALANCED Query Suggestions Promote the variant with

    too little traffic in your auto-suggestions to reduce the imbalance. Query Recommendations Promote the variant with too little traffic in your query recommendations (others searched for) to reduce the imbalance. 05
  30. NOT EVERY EXPERIMENT MAKES SENSE Even if data shows that

    an Experiment has great potential — this could dramatically change over time. Implement FAST-EXITS for cases like: ❏ Zero-Result Queries ❏ Marketing Campaigns ❏ Seasonal Changes 05 Problem Solution
  31. IS SIGNIFICANCE ALL WE CARE ABOUT? Experimentation also involves managing

    risk. If an observed effect isn't statistically significant, but the likelihood of it being better is substantial and the risk of being much worse is minimal, why not give it a try? Problem Solutions 05
  32. SOME FACTS from PROD success-rate we finished ~1900 8.51% +16.1%

    avg. Treatment Effect Experiments last week for our customers with a and a 06
  33. WHY YOU SHOULD CARE Leveraging the power of tiny gains

    In the past five months since its launch, Search Experimentation (Query Testing) has consistently boosted the overall weekly search KPIs by approximately 0.49% compared to the hold-out-variant. While this may seem limited, it accumulates to an impressive overall enhancement of 9.87% over the entire period, without any signs of decline.(and seasonal effect)
  34. WORK IN PROGRESS Communicating Experiment Results Together with our customers,

    we are still figuring out the best way to communicate experimental results to maximize impact. Unfortunately the interpretation of these results is not always straightforward: ❏ Contradicting KPIs (CTR improves, ATBR does not) ❏ Effect not significant but close and potential very high (Risk Management)
  35. THANK YOU! Questions? Talk to me and stalk me here

    CREDITS Presentation Template: SlidesMania Icons: Flaticon