Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Bag-of-Documents Model for Query Understand...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

The Bag-of-Documents Model for Query Understanding and Retrieval

This talk, presented in a series for Doug Turnbull and Trey Grainger's class on AI-powered search, explains ways to use the bag-of-documents model to align query and document representations and thus improve query understanding and retrieval. It also links to Github and Hugging Face resources that you can use to explore and apply the approach yourself.

Avatar for Daniel Tunkelang

Daniel Tunkelang

May 19, 2026

Other Decks in Technology

Transcript

  1. Overview • What is the bag-of-documents model? ◦ Way to

    align query and document representations. • How do you implement it? ◦ Aggregate queries and relevance judgments into bags. ◦ Use bags to fine-tune a base model. • What are its pros and cons? ◦ Often beats alternatives, but can be tripped up.
  2. Useful Resources • Github: https://github.com/dtunkelang/bag-of-documents • Hugging Face: ◦ https://huggingface.co/datasets/dtunkelang/bag-of-documents

    ◦ https://huggingface.co/spaces/dtunkelang/bag-of-documents-demo ◦ https://huggingface.co/spaces/dtunkelang/bag-of-documents-bestbuy-demo • Semantic Equivalence of e-Commerce Queries (KDD ‘23)
  3. What is the bag-of-documents model? • A way to align

    query and document representations by modeling queries as distributions of document vectors. • For queries where we have a distribution of relevant documents, we just aggregate their vectors into “bags”. • To generalize, we fine-tune the document base model, using the query bags as targets.
  4. Query Understanding • To use dense embedding-based retrieval, it is

    essential to have robust query vectors. [−0.9704, 0.2045, 0.1281 … ]
  5. Aggregating Relevant Document Vectors [0.13, 0.81, …], [0.09, 0.75, …],

    [0.98, 0.77, …],… [0.11, 0.79, … ] mens black tshirts Bag Relevant Documents Query base model
  6. Sources for Relevance Judgments • Implicit judgments from engagement: clicks,

    etc. • Explicit judgments (qrels) from human raters. • Automated judgments from LLMs, cross-encoders. • Can mix and match any or all of the above!
  7. Ranking ≠ Relevance! • Explicit and automated judgments focus on

    relevance. • Engagement-based judgments conflate desirability. • LLMs and cross-encoders vary widely in quality and cost. • Judgments are the foundation, so invest effort here!
  8. Fine-Tuning a Retrieval Model using Bags • Start with same

    encoder used for index, e.g., MiniLM. • Turn the bags into training data. • Fine-tune base model using this training data. • Key decision is the choice of loss function.
  9. Using the Bag-of-Documents Model for Reranking • Retrieve candidates using

    lexical or dense retrieval. • Score each candidate with a BoD-trained encoder. • Use one – or multiple – BoD-trained rerankers. • Reranking sometimes outperforms BoD-based retrieval!
  10. Try It Yourself! Locally: - git clone https://github.com/dtunkelang/bag-of-documents - cd

    bag-of-documents - pip install -r requirements.txt - bash scripts/run_esci_us_demo.sh - python demo.py Live Demos on Hugging Face: - https://huggingface.co/spaces/dtunkelang/bag-of-documents-demo - https://huggingface.co/spaces/dtunkelang/bag-of-documents-bestbuy-demo
  11. To Bag or Not To Bag? • Cluster hypothesis must

    hold: documents relevant to the same query must cluster under the base encoder. • Base model needs to have headroom for improvement. • There have to be multiple relevant positives per query. • The generalization lift must exceed the specialization tax.
  12. Take-Aways • Align query and document representations. • Aggregate queries

    and relevance judgments into bags. ◦ Judgments are the foundation, so invest effort here! • Use bags to fine-tune a base model. • Use model for retrieval or reranking. • Grab code and models from Github and Hugging Face!
  13. Thank You! Daniel Tunkelang [email protected] https://www.linkedin.com/in/dtunkelang/ https://dtunkelang.medium.com/ https://queryunderstanding.com/ http://contentunderstanding.com/ Github:

    https://github.com/dtunkelang/bag-of-documents Hugging Face: https://huggingface.co/datasets/dtunkelang/bag-of-documents https://huggingface.co/spaces/dtunkelang/bag-of-documents-demo https://huggingface.co/spaces/dtunkelang/bag-of-documents-bestbuy-demo