unsecured loans?” ◦ Doc: “Auto loans are secured against the car. "Signature" loans, from a bank that knows and trusts you, are typically unsecured. unsecured loans other than informal ones or these are fairly rare. Most lenders don't want to take the additional risk, or balance that risk with a high enough interest rate to make the unsecured loan unattractive.” • Product Search ◦ Query: “headphones noise-cancelling” ◦ Doc: “bluetooth noise-cancelling headphones 60H playtime High Res Audio” How (Lexical) (Product) Search Works System design varies depending on the domain Each query term acts as a filter condition Retrieve the doc that satisfies the central aspect Users expect the product to satisfy all the requirements
pertains to the topic of the question ◦ Merely rely on term overlap, it is not necessary for all the query terms to appear in the doc • Product Search ◦ How well the item satisfies the given specifications ◦ Query terms are used to narrow down the search space Relevance Definition
Substitute (A ∨ B) Irrelevant ¬(A ∨ B) Note: Each filter condition is not necessarily a single term • “nike shoes” 👉 Exact • “adidas shoes” 👉 Substitute • “hair dryer” 👉 Irrelevant
woman vs. women, shoe vs. shoes “headphones not wierless” negation + spelling error “christmas pjs for toddlers” pjs = pajamas, toddlers = kids, children “affordable laptops with good battery life” subjective terms Fragility to spelling variants Linguistic Ambiguity Lack of understanding of semantic relationships well-known typical issues, solutions do exist
for cats who may have difficulty climbing or jumping → • Low Platforms • Gradual Steps or Ramps • Soft Surface Hidden requirements don’t appear in the query text
is not differenciable • Add more subcomponents ◦ How to measure the effectiveness of the subcomponents? ◦ Are their offline metrics correlated well with online metrics? Why Lexical Search becomes complex? Objective Synonym Dictionary Lexical Search ? The number of synonyms ∝ Revenue?
Search vs. Semantic Search? • Q. How can we incorporate semantic matching signals? ◦ 👉 Integration of Semantic Search • Q. Is Semantic Search cost-effective? ◦ 👉 Development of Semantic Search Questions to explore
• Public benchmark results may not be entirely reliable • Their experiments are not designed for your domain • You need to verify the effectiveness yourself Implementation?
Great High High Low 3. Acceptable Low High High 4. Unacceptable High Low High - Low Low Low NDCG: High, Prec: Low → Unacceptable We found that NDCG has a positive correlation with revenue while Precision has a negative correlation with the number of complaints from users
at the lowest price • Users complain when they see irrelevant items Evaluating Product Search • MRR is not appropriate • Small k is insufficient (such as metric@10) • We can’t rely sorely on NDCG, Precision matters a lot
Text Encoder (LM) Text Encoder (LM) How to sample pairs? Which field to encode? Which loss to use? Thresholding? Features Features Product Which pre-trained model to use? • Incorporating Additional Features: https://arxiv.org/abs/2306.04833 • Multi-Stage Training: https://www.amazon.science/publications/web-scale-semantic-product-search-with-large-language-models Heuristic post-filters?
◦ Users copy & paste product names • User Survey: ◦ Q. When do you use the platform? ◦ A. When searching for a specific item to buy (66.4%) This result is not surprising because Query: “product attribute attribute …”
unseen relevant Semantic Search@k irrelevant 8 1 k items • Found in lexical search: {1,2,3,5,8} → 5 items • Found in semantic search: {3,6,1,9,2,4,7} → 7 items • Found in both: {1,3} → 2 items NDCG is almost the same but what are retrieved? Method NDCG@100 Lexical (Combined + Relaxation) 0.5394 Semantic (CL) 0.5328 3
Search@100 Products Retrieved from Semantic Search@100 Found in Both • In terms of NDCG, they are “the same” but they retrieve different products • How they differ?
girls 0.0 0.89 accidental love by gary soto 0.0 1.0 baby boy first birthday cookie montser decorations 0.0 0.8936 I want a long jacket that comes to mid leg in a dark colour and very warm 0.0 0.1673 Method NDCG (Lexical) NDCG (Semantic) “the first 90 days” (book title) 0.6166 0.1934 “summit 470” (model name) 0.8981 0.0 dell 40wh standard charger type m5y1k 14.8v (attributes) 1.0 0.2105 Lexical > Semantic Lexical < Semantic
Semantic Search Number of test queries Short Query 0.5403 0.4707 5878 Long Query 0.3262 0.5580 452 Contains Non-Alphabet 0.4267 0.4961 4096 Negation 0.4322 0.5195 1624 Parse Pattern 0.6184 0.6285 2250 Parse Pattern: Queries with some linguistic complexity, extracted using regular expressions
◦ Multi-Modality ◦ Session information ◦ Buyer preferences ◦ ☝ There is no dataset with these features, so we can’t compare • Two systems have different pros and cons ◦ Use them together? Is Semantic Search useless in product search?
sections like YouTube • Add diversity to SERPs but the mainstream search results don’t change • Low risk, limited gain • Having many options increases cognitive load Separated UI Components
Relevant Relevant Less Relevant Source 1 Source 2 • Cherry-picking the best results from multiple sources (Skimming Effect) • It requires low cognitive load • Recall issue is addressed
• RR (Reciprocal Rank): ◦ 1 / (k + score), where ▪ k is a constant (often 60) that determines the degree of top-heaviness • TMM (Theoretical Min-Max): ◦ (score - score_min) / (score_max - score_min), where ▪ score_max = the score of the top-ranked result ▪ score_min = 0 • Borda Count: ◦ (N + 1 - score), where ▪ N = the number of results in the list 1. Normalize (Transform) scores TMM is said to be better in theory, but in practice, there is no significant difference (IMO)
= 0.48 How NDCG changes with different α When RRF and there is no overlap between systems ↓ RR depends on the rank = Scores of the same rank will be the same
everyone says “Yes”, it should be more relevant (Chorus Effect) ◦ Max: If an expert says “Yes”, it should be more relevant (Dark Horse Effect) • References ◦ An Analysis of Fusion Functions for Hybrid Retrieval (2023) ◦ Who's #1?: The Science of Rating and Ranking 3. Combine results
long jacket that comes to …” may not be found in Lexical Search SUM selects items from the top right lexical score: 0.51 + semantic score: 0.51 semantic score: 1.0 >
Relevance Judgements Random Items Search Logs Evaluation using search logs makes the comparison unfair for Semantic Search Emphasizes lexical matching signals Rerankers can be trained using search logs but…
whether an item is relevant or not ▪ Relevance judgement is subjective ◦ Google has 170 pages of guidelines for annotators • LLM as a Judge ◦ “LLM labellers can do better on this task than human labellers for a fraction of the cost” (link) • Debiasing dataset is still important ◦ Recall the importance of the proportion
significantly high • Semantic Search may not be effective for all queries Potential Gains vs. Costs (ROI) Lexical Inverted Index Semantic ANN Index Can we simplify Lexical Search? Is it possible to reduce the costs?
◦ Rank Fusion Potential Gains vs. Costs (ROI) • Costs ◦ Latency ◦ Throughput ◦ Cost / 1M query ◦ GPU cost / month ◦ Cost for dataset creation ◦ Engineering cost Optimal model might vary depending on the use case Which search engine should we adopt? What is the best way to show semantically matching items? The cost of obtaining labels remain high Latency is increasing Do we need another search team dedicated to semantic search? GPU shortage Utilizing visual features is great but significantly slows down interation What’s the point if users don’t want semantic search? How can we measure the opportunity size? Bias in dataset
on our platform, they switch to Google. It highlights the importance of investing in new technology to stay competitive. “Keyword search is something that old people do” Query: “product attribute attribute…” → “What I want is …”
Practical Approach ◦ Understand the user behavior ◦ Choose the right metrics ◦ Optimize the system while considering costs • Basics are still important in the era of AI ◦ Just as Lexical Search can’t be replaced by Semantic Search, search engineers can’t be replaced by AI engineers (for now) rejasupotaro rejasupotaro Any thoughts? 👉