Session Abstract
From relevance guesswork to granular search experimentation at scale. Evaluate modifications to your search, such as incorporating synonyms, adjusting field-boosts, adding vector search, or product assortment modifications, on real data rather than relying solely on intuition.
Session Description
Measuring the effectiveness of changes in search functionality, whether significant or minor, remains a persistent challenge. Even experienced teams often rely on search relevance feedback labels to gauge outcomes. While Web-scale Experimentation, such as A/B or Multivariate Testing, is widely practiced, only a few companies utilize it to enhance their search systems incrementally on a query-level basis. A/B Testing in the realm of search presents unique complexities. It involves various considerations and lacks a universal approach. Challenges include selecting appropriate metrics for evaluation, determining suitable randomization units, and addressing imbalanced and sparse experimental data to establish precise decision boundaries, especially with limited data availability. Over time, we've developed, refined, and operated our "Query Testing" capability. I aim to share insights gleaned from this journey.