Everything Old is New Again: Why Information Retrieval Still Powers AI Search

Who is Dawn Anderson?  SEO practitioner for almost 20
years  International SEO conference speaker since 2017  Boutique agency owner (Bertey)  SEO consultant  Information retrieval & AI search world interloper  Commissioned to write a book on AI SEO currently

Pomeranian lover

Everything old is new again Why information retrieval still powers
AI search

The AI Search Paradigm The paradigm shift is huge with
a steep learning curve

AI Search is Standing on Strong Shoulders

The Generative AI Explosion

The Myth of the 'All- Knowing' AI

The Achilles Heel: Hallucinations

The Knowledge Cutoff Problem

Bridging the Gap with Grounding

Enter RAG (Retrieval- Augmented Generation)

The 'R' in RAG Does the Heavy Lifting The generation
is only as good as the context you provide it (retrieval)

What is Information Retrieval (IR)? • Computer science field behind
search as we know it • Many nuanced specialisms within the field • Close relatives, offspring and siblings in: o Natural language processing o Recommender systems o AI search / Generative information retrieval o Knowledge graphs and structured data specialisms

A Brief History of IR

The Architecture of AI Search

The Modern AI Search Pipeline

Data Ingestion

The Art of Chunking LLMs have context windows with limitations

Chunking Strategies • Fixed-size vs. semantic chunking • There are
many different types of chunking • Fixed size (word, sentence, paragraph) is very rigid • Semantic chunking takes into consideration meaning / context

Metadata: The Unsung Hero • The importance of tagging chunks
with metadata (date, author, source) • Allows for search filtering later • Classic IR approaches adapted for AI search

Indexing & Storage The concept of search indexes

The Inverted Index (Lexical Search) How traditional keyword search works.

Enter Vector Databases Introducing vector storage for semantic search.

Embeddings: Text to Numbers How embedding models translate text

Lexical vs. Semantic Storage • Comparing traditional indexes with vector
indexes

Query Processing CLASSIC SEARCH – STEMMING, SYNONYMS AI SEARCH –
QUERY REWRITING, QUERY EXPANSION (FAN OUT)

Handling Conversational Ambiguity • Query rewriting to resolve pronoun and
conversational context prior to searching

Query Expansion – Adding Context to a Query to Improve
Recall

Probably 'Query Fan Out'

Intent Detection

Self-Querying Retrieval – Advanced RAG Technique • Reduces hallucination using
meta-data filtering

Retrieval & Ranking (The Core IR Concepts) The Goal: Relevance

Precision vs. Recall - The classic IR trade-off • Precision:
Out of all the documents we retrieved, how many were actually useful? • Recall: Out of all the useful documents in the database, how many did we manage to find?

BM25: The Algorithm That Won't Die

Semantic Similarity (K-Nearest Neighbors) • How vector search ranks results
• Distance between query vector and document vectors • Query as centroid

The Flaws of Vector Search - Why vector search isn't
a silver bullet • Cat chasing dog • Dog chasing cat • In vector search these are mostly the same but... obviously they are NOT • Vector search struggles with exact phrasing, part numbers, and negations

The Winning Combo: Hybrid Search Combining BM25 and Vector Search

Reciprocal Rank Fusion (RRF) • Hybrid merging different search scores

The Two-Stage Pipeline: Retrieve & Re- Rank The industry standard
for high- accuracy AI search

Cross-Encoders vs. Bi-Encoders • Vector search uses bi-encoders • Re-ranking
uses cross-encoders

Learning to Rank (LTR) Using machine learning to optimise ranking
Training a model to weigh different signals What matters most in different contexts for different queries? Tune from learnings

Context Injection in LLM Models – Advanced RAG Technique Source:
https://apxml.com/courses/getting-started-rag/chapter-4-rag-generation-augmentation/context- injection-methods

The "Lost in the Middle" Phenomenon A quirk of how
LLMs read context

Citation and Provenance Providing sources for AI answers by mapping
LLM output to retrieved chunks

Streaming Answers Improving perceived latency

Fallback Strategies What to do when no relevant data is
found Good LLM systems should return "I don't know" Often they just guess the most probabilistic answer LLM must be instructed to answer "I don't know"

Evaluation (How Do We Know It Works?) Evaluation: The Hardest
Part of AI Search

Classic IR Metrics MRR (Mean Reciprocal Rank) 1 NDCG (Normalised
Discounted Cumulative Gain) 2 MAP (Mean Average Precision) 3

RAG-Specific Evaluation Frameworks Introducing RAGAS / TruLens

RAG- Evaluation Was good context retrieved? Did the LLM stick
to the context? Did it actually answer the user's question?

LLM-as-a-Judge Using powerful models to grade cheaper models

The Importance of Golden Datasets Creating a baseline for testing

The Future – The Feedback Loop Using user behaviour to
improve search

The Future - Beyond RAG: Agentic Search The shift from
single queries to multi-step reasoning

The Future - Multi-Modal Retrieval Searching across images, video, and
audio

Respect the Foundations Generative AI may be the shiny user
interface but IR is the reliable pipeline.

Thank you • X – dawnieando • LinkedIn – MsDawnAnderson
• Threads – dawnieando • Bluesky – dawnieando • Bertey.com

Everything Old is New Again: Why Information Re...

Everything Old is New Again: Why Information Retrieval Still Powers AI Search

More Decks by Dawn Anderson

Other Decks in Marketing & SEO

Featured

Transcript