Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Search Engines Really Work in 2023

How Search Engines Really Work in 2023

Michael King

June 28, 2023

More Decks by Michael King

Other Decks in Marketing & SEO


  1. 2

  2. 3

  3. 5 5 Disclaimer: Just because something is in a patent,

    or a whitepaper does not mean that Google uses it…but it probably does.
  4. 7 So, I Have a Book Coming Soon The Science

    of SEO: Decoding Search Engine Algorithms. This is the cover that my publisher sent me. I’m not a fan, but we’ll see what happens. Anyway, you can preorder it wherever books are sold. Here’s the Amazon link. https://amzn.to/3T9qkYN
  5. 8 8 Thesis: Search engines are not magic. You can

    deeply understand them if you learn more about Information Retrieval and pay close attention to engineering research.
  6. 9 9 If you pay enough attention, they are telling

    you everything you want to know about how Google works.
  7. 11 Meet Mortimer Taube Taube invented the “Uniterm Indexing System”

    in 1951 because he felt the Dewey Decimal System could not keep up with the pace of information after the war. This is the basis of what is called an Inverted Index, the data structure behind what we think of as an “index” in search engines.
  8. 12 Meet Hans Peter Luhn This guy invented the concept

    of Term Frequency in his paper “The Automatic Creation of Literature Abstracts.” He also invented hashmaps, but we’ll talk about that later.
  9. 13 Meet Sparck Jones Sparck Jones is considered the godmother

    of IR. She contributed to information retrieval in many ways, but she is best known for inventing Inverse Document Frequency in the 70’s.
  10. 15 Meet Gerard Salton Gerard Salton is the godfather of

    Information Retrieval. Much of how search engines of all kinds work is based on methods that he and his team invented.
  11. 16 16 Gerard Salton invented the Vector Space Model In

    the vector-space model, documents and queries are converted to vector representations and plotted in multi-dimensional space. The query and document vectors are then compared based on cosine similarity and the ones that are closest to the query are the most relevant. The main takeaway here is that relevance is a quantitative value. This is perhaps the most important concept to understand about how search works.
  12. 20 20 Amit Singhal Rewrote Google Search in 2001 And

    he was nice enough to tell us exactly how he did it. http://singhal.info/ieee2001.pdf
  13. 22 22 Meet Brian Pinkerton Brian built the first commercially

    available web-scale search engine based on a crawler called WebCrawler in 1994. He wrote about it extensively for his PhD thesis. http://www.thinkpink.com/bp/Thesis/ Thesis.pdf
  14. 23 Every Search Engine is Based on WebCrawler Every search

    engine can trace its roots back to WebCrawler to some degree. In fact, Lycos, AltaVista, and Google all reference it in their early papers and patents. You know why page titles have been so important for so long? Early search engines only indexed page titles.
  15. 27 27 Meet Jeff Dean This is Jeff Dean. He’s

    had an engineering hand in many of Google’s most important innovations ever.
  16. 28 28 Jeff Talks About How He Ended Up at

    Google from AltaVista https://www.quora.com/What-was-it-like-to-work-on-the-AltaVista-team-in-the- 90s?ch=10&oid=960520&share=21e5a871&srid=uHsr&target_type=question
  17. 30 30 Fun Fact I’m the only SEO that Jeff

    follows, so by the principles of PageRank, I’m the greatest SEO of all time. �
  18. 31 Tweets is Watching The SVP that runs Search at

    Google is also following me, so if anything happens to me…
  19. 32

  20. 33 33 Google Operates as a Shared Environment All the

    software across eco can be installed on any machine and any process can be run on any machine. For example, a crawler could also on a machine that is managing rendering or processing or anything else.
  21. 34 34 Fun Fact: Penguin was built on top of

    Panda Panda was a group-specific modification factor that was computed as a function of: The number of independent links divided by the number of reference queries. Penguin built on top of that quality score and applied it to links.
  22. 36 36 At a Base Level, This is What all

    Search Engines Do Fundamentally, this is the basis of how search engines function. Google has developed many layers on top of this, but this is the core of what they all do.
  23. 38 38 We know this, but there is a single

    set of innovations that sped Google past the SEO community.
  24. 39 39 Lexical Search vs Semantic Search are the Two

    Primary Search Models What we as the SEO community do not have a strong enough handle on is that most of what Google’s doing is on the semantic side and that has all improved dramatically over the last 10 years based on machine learning.
  25. 40 40 Vector Space Model Again Let’s go back to

    the vector space model again. This model is a lot stronger in the neural network environment because Google can capture more meaning in the vector representations.
  26. 42 42 This Allows for Mathematical Operations Comparisons of content

    and keywords become linear algebraic operations.
  27. 43 43 Relevance is a Function of Cosine Similarity When

    we talk about relevance, it’s the question of similar is determined by how similar the vectors are between documents and queries. This is a quantitative measure, not the qualitative idea of how we typically think of relevance.
  28. 44 44 TF-IDF Vectors The vectors in the vector space

    model were built from TF-IDF. These were simplistic based on the Bag-of-Words model and they did not do much to encapsulate meaning.
  29. 45 Word2Vec Gave Us Embeddings Word2Vec was an innovation led

    by Tomas Milosevic and Jeff Dean that yielded an improvement in natural language understanding by using neural networks to compute word vectors. These were better at capturing meaning. Many follow-on innovations like Sentence2Vec and Doc2Vec would follow.
  30. 51 Dense Retrieval You remember “passage ranking?” This is built

    on the concept of dense retrieval wherein there are more embeddings representing more of the query and the document to uncover deeper meaning.
  31. 53 53 Introducing Google’s Version of Dense Retrieval Google introduces

    the idea of “aspect embeddings” which is series of embeddings that represent the full elements of both the query and the document and give stronger access to deeper information.
  32. 54 54 Dense Representations for Entities Google has improved its

    entity resolution using embeddings giving them stronger access to information in documents.
  33. 55 55 Embeddings keep getting better at capturing meaning while

    SEO tools still operate on the Lexical Search model
  34. 64 64 How Google Crawls the Web Most of the

    magic happens in the URL manager. The crawler simply accesses a page and extracts it. The processing pipeline handles most of the actual parsing. Source: Distributed Crawling of Hyperlinked Documents https://patents.google.com/patent/U S8812478B1/en
  35. 66 66 Crawling is Stateless Googlebot does not hold a

    “state.” Although it has the capabilities to, it does not maintain cookies, fill out forms, or make POST requests. Every page it looks at is as though it turned logged on to the web for the first time and
  36. 67 67 Google Crawls with a Very Tall Viewport As

    we know Googlebot is crawling mobile-first primarily, but they have limits of what they will see based on infinite scroll.
  37. 68 68 Crawl Models Typical IR models are breadth-first (whole

    level is reviewed) or depth-first (last node every path before moving on). Google uses a “best-first” model following PageRank Depth-first Breadth-first
  38. 69 69 Where Does the “Search Engines Only Crawl 5

    Levels Deep” Come From? A paper by IR legend and Yahoo researcher Ricardo Baeza-Yates entitled “Crawling the Infinite Web” identified that crawling only five levels deep is enough to get the most valuable content on the web. https://chato.cl/papers/baeza04_cra wling_infinite_web.pdf
  39. 70 70 Crawl Frequency Estimation Google would love to use

    your dates from Schema and your lastmod from your sitemap, but they can’t trust them. So, keep every version of your content that they crawl and they make determinations on how frequently pages change to decide how often to crawl the page.
  40. 71 71 They May Stop Crawling Based on URL Patterns

    If Google believes that the URL pattern is going to yield less value if they crawl the page, they will stop crawling all URLs that fit that pattern.
  41. 72 72 How XML Sitemaps Come Into Play Google downloads

    XML sitemaps regularly from a separate crawler to update their “per site” database. That database informs the list of URLs that go to the scheduler and it treats “differential sitemaps” with higher priority. There’s also a secondary crawler system for URLs in XML Sitemaps.
  42. 74 74 The Generative AI Hack to Increase Crawl for

    a Page A good way to improve crawl is by updating your pages regularly. An automated way to make it change is by putting a NLG summary at the top of the page and updating it frequently.
  43. 76 76 Crawl Rate Limiting How often Google crawls is

    a function of how much load a host can handle. Increase the capacity and they will crawl more.
  44. 77 Crawl Demand This is an area where social signals

    used to play a heavy factor, but crawl demand is mostly a function of PageRank.
  45. 78 78 On Crawl Budget = Server Response Time x

    Time / Error Rate = TTFB x Duration / %Server Error = (Avg. TTFB x Duration / %Server Error) * (CTR x Average Time between page updates) = (Avg # of Crawled URLs x Frequency) / Time @JoriFord
  46. 79 79 How Google Handles Pages that Don’t Change Pages

    that have either explicitly or implicitly indicated that they don’t change (304 response code) are basically put on a timeout for a while and Google will reuse what it has in the index. That cache expiry refreshes on a set interval.
  47. 80 80 What About IndexNow? I don’t see Google joining

    this initiative because of the cross-search engine URL submission requirement. I could imagine them coming up with their own version of the spec though.
  48. 81 81 The Best Things to Do to Get More

    Crawl Activity Load Balance – Route Googlebot to its own autoscaling instances by IP Submit Differential Sitemaps Update your pages regularly Align lastmod with structured data date and on-page date Make sure your robots.txt never returns a 500 Track your crawl budget metrics
  49. 84 84 Back in the day Google Only Indexed the

    first 100kb of the page. Now they do 15MB
  50. 86 86 Documents are Parsed and Stored in an Inverted

    Index An inverted index is like an index in a book where each word is mapped to documents that it appears in.
  51. 87 87 Phrase-Based Indexing was a key Google Innovation Before

    Anna Paterson led the phrase- based indexing initiative, search engines built inverted indexes on single phrases and then built posting lists at the intersections of phrases in queries. Phrase-based indexing upended this and introduced phrase co-occurrence and predictive modeling based on those phrases.
  52. 88 88 Google Saves Versions of Your Pages Forever in

    the Document Server There are a variety of operations that Google does based on your content over time. So they have cached versions from the first time a page appeared.
  53. 90 90 Crawl Tiering-based on Update The index is stored

    in multiple tiers across many machines and split into three dimensions based on how important the page is. Super important and regularly accessed pages are stored in memory. Pages of medium importance stored on solid state drives for fast reads. Pages that are not so important are stored in standard HDDs since they are cheap and don’t need to be fast. Distributed Crawling of Hyperlinked Documents https://patents.google.com/patent/U S8812478B1/en
  54. 91 91 Deduplication and Canonicalization Deduplication and canonicalization are handled

    through a series of fingerprints and comparison. There many signals that inform this process such as links, redirects, alternates, etc. Google uses a machine learning classifier to make the final canonical determination.
  55. 92 92 The Best Things to Improve Indexing Limit Duplication

    with More Unique content per page Limit your cannibalization through your anchor text Update your pages regularly
  56. 96 96 The Web Rendering System Closes the Gap The

    Web Rendering System uses a modified version of headless Chromium to render pages. It has different behaviors than a users browser like how it handles random, dates, and service workers. It doesn’t paint pixels because there’s no reason to, but it will stop executing if a process takes up too much CPU.
  57. 97 97 Rendering is Separate Because It’s Computationally Expensive The

    WRS is not going to render every page unless it believes its worthwhile.
  58. 98 98 Websites Have Many Options for Rendering These Days

    Google handles SSR the best, obviously, but they can access your content with any of these models.
  59. 99 99 The Best Things to Improve Rendering SSR, if

    you can Make your pages worth rendering if you can’t Monitor your crawl volume vs CPU usage
  60. 10 6 Websites as Vectors Just as there are representations

    of pages as embeddings, there are vectors representing websites and authors.
  61. 10 7 Author Vectors Similarly, Google has Author Vectors wherein

    they are able to identify an author and the subject matter that they discuss. This allows them to fingerprint an author and their expertise.
  62. 10 8 Build Your Links Contextually If you’re still building

    links, it’s very likely that they have ramped up the capabilities around relevance between pages for links. They are likely discounting pages that are not close relevance matches anymore.
  63. 11 0 11 0 Google Is Now a Series of

    Over 200 Microservices Running in Parallel
  64. 11 4 11 4 Expansions Are Scored The different versions

    of the query are scored and they may be ran in parallel and then the results are scored and then they return the best set.
  65. 11 7 11 7 Here’s an example [The Rock] It’s

    also relevant to a movie called “The Rock”
  66. 11 8 11 8 [The Rock imdb] Google’s not sure

    what you mean here, so it’s showing both.
  67. 11 9 11 9 [The Rock] is expanded to [The

    Rock actor] in the background
  68. 12 2 12 2 Neural Matching to Determine the Meaning

    of the Query Again, with the embeddings!
  69. 12 4 12 4 Document Scoring Simplified Content Factor Content

    Factor Speed Factor Link Factor Link Factor Document Score + + + + = HOW SEARCH ENGINES REALLY WORK IN 2023
  70. 12 5 12 5 Each Component to the Equation Has

    a Weight Content Factor Content Factor Speed Factor Link Factor Link Factor Document Score a b c d e + + + + = HOW SEARCH ENGINES REALLY WORK IN 2023
  71. 12 6 12 6 The Weights May Look Like This

    Content Factor Content Factor Speed Factor Link Factor Link Factor Document Score 3 6 1 2 2 + + + + = HOW SEARCH ENGINES REALLY WORK IN 2023
  72. 12 7 12 7 This is What Marketers Do 5

    2 4 95 74 Content Factor Content Factor Speed Factor Link Factor Link Factor 369 3 6 1 2 2 + + + + = HOW SEARCH ENGINES REALLY WORK IN 2023
  73. 12 8 12 8 So, Then, Google Turns Down the

    Weights on Links 5 2 4 95 74 Content Factor Content Factor Speed Factor Link Factor Link Factor 55.49 3 6 1 .25 .01 + + + + = HOW SEARCH ENGINES REALLY WORK IN 2023
  74. 12 9 12 9 Google Understands Queries by Breaking them

    Into Entities Leveraging Entities Allows Queries to be Expanded and Related Entities and Attributes to be Discovered
  75. 13 0 Google’s Scoring Functions There’s more than one scoring

    function. Google scores content and links a variety of different ways and then chooses the best results. There is not just one “algorithm.” This is why different queries seem to value signals differently. HOW SEARCH ENGINES REALLY WORK IN 2023
  76. 13 1 Post-retrieval Adjustments In addition to their being multiple

    scoring functions with different results to choose from, Google may make further re-ranking adjustments based on any number of features and factors. So, really, anything could happen in the SERPs.
  77. 13 2 13 2 When Amit Singhal ran Google Search

    he was famously against using machine learning in rankings.
  78. 13 6 13 6 John Giannandrea from Google Brain took

    over and certainly did not have that bias.
  79. 13 7 13 7 Learning to Rank Learning to Rank

    is using supervised machine learning for information retrieval systems.
  80. 13 8 13 8 Learning to Rank Requires One of

    Two Things Human reviewed quality scores Implicit User feedback
  81. 13 9 13 9 Google Has the Quality Rater Program

    The Quality Ratings are not just for evaluation. They act as the feature engineered data that trains the learning to rank models.
  82. 14 7 147 The inputs that we control have not

    changed, but our understanding of what Google is doing with them needs to.
  83. 14 8 What about the new Search Generative Experience (SGE)?

  84. 14 9 At I/O Google Announced a Dramatic Change to

    Search The experimental “Search Generative Experience” brings generative AI to the SERPs and significantly changes Google’s UX.
  85. 15 0 15 0 Queries are Longer and the Featured

    Snippet is Bigger 1. The query is more natural language and no longer Orwellian Newspeak. It can be much longer than the 32 words that is has been historically in order 2. The Featured Snippet has become the “AI snapshot” which takes 3 results and builds a summary. 3. Users can also ask follow up questions in conversational mode. 3 2 1
  86. 15 1 15 1 This is Called “Retrieval Augmented Generation”

    Neeva, Bing, and now Google’s Search Generative Experience all use pull documents based on search queries and feed them to a language model to generate a response.
  87. 15 2 15 2 Google’s Version of this is called

    Retrieval-Augmented Language Model Pre-Training (REALM )
  88. 15 3 15 3 SGE is built from REALM +

    PaLM 2 and MUM MUM is the Multitask Unified Model that Google announced in 2021 as way to do retrieval augmented generation. PaLM 2 is their latest state of the art large language model.
  89. 15 4 15 4 If You Want More Technical Detail

    Check Out This Paper https://arxiv.org/pdf/2002.08909.pdf
  90. 15 5 It’s Experimental because it’s Error-prone Bing and ChatGPT

    lit a competitive fire under Google, but they have been working on these technologies for years. They were slow to release because of the various reasons that LLMs are likely to return disinformation.
  91. 15 6 15 6 The Experience May Also Pollute Search

    Quality The experience of a response from Google suggests that there is a person giving the response. The generative text may also conflict with other aspects returned in search.
  92. 15 8 The Search Demand Curve will Shift With the

    change in the level of natural language query that Google can support, we’re going to see a lot less head terms and a lot more long tail term. Going down Going up
  93. 15 9 15 9 The CTR Model Will Change With

    the search results being pushed down by the AI snapshot experience, what is considered #1 will change. We should also expect that any organic result will be clicked less and the standard organic will drop dramatically. However, this will likely yield query displacement.
  94. 16 0 Rank Tracking Will Be More Complex As an

    industry, we’ll need to decide what is considered the #1 result. Based on this screenshot positions 1- 3 are now the citations for the AI snapshot and #4 is below it. However, the AI snapshot loads on the client side, so rank tracking tools will need to change their approach.
  95. 16 1 161 None of this changes what we do

    tactically, but it may change what we do strategically.
  96. 16 2 The future of Content and Links Or how

    is generative AI going to change all of this for us? HOW SEARCH ENGINES REALLY WORK IN 2023
  97. 16 5 16 5 One of Singhal’s Early Innovations was

    Doc Length Normalization Google has always had the idea of making sure content length isn’t an overpowering factor. Amit Singhal recognized longer documents inherently outperform shorter ones in retrieval tasks, so it’s always been a fundamental thing that Google looked at.
  98. 16 6 Marketers Are Just Copying… People are skipping the

    step in the Skyscraper technique wherein they’re are supposed to create “better” content.
  99. 16 7 16 7 This is What a Lot of

  100. 17 0 170 We need to evolve beyond what is

    basically complex “keyword density.”
  101. 17 1 17 1 Soon, Everyone will be able to

    Generate Perfectly Optimized Content
  102. 17 2 In Fact, Kristin @ Fractl Built It Kristin

    built a tool that allows someone to put in a keyword or a topic and it will generate robust content based on what is currently ranking. https://www.frac.tl/interactives/long- form-article-generator/
  103. 17 4 17 4 This is Where Information Gain Comes

    Into Play Conceptually, as it relates to search engines, Information Gain is the measure of how much unique information a given document adds to the ranking set of documents. In other words, what are you talking about that your competitors are not?
  104. 17 5 Google’s Information Gain Patent Google’s patent indicates that

    they are specifically scoring for documents that feature net new information over other documents on the same topic.
  105. 17 6 17 6 Information Gain is Best Driven by

    Looking at Across the Entity Graph •Thus far, there is a very limited set of tools in the SEO space that are specifically looking at entities and their relationships. A non-SEO tool called EntiTree visualizes related entities from Wikidata. https://www.entitree.com/ Using this will give you insights into what entities are being considered for your target entity.
  106. 17 8 17 8 It’s Not Exactly Clear What SEO

    Tools Are Looking At These seem to be topics, but are they entities?
  107. 17 9 17 9 Some Tools Are Looking Vertically at

    a SERP for Term Co-occurence •While it’s possible that it may yield the same or similar results, tools like this are not looking across relationships of entities.
  108. 18 0 18 0 Other Tools are Mapping Topical Clusters

    •While this approach captures more breadth as it relates to the topic, it is not the same as reviewing entities.
  109. 18 2 18 2 Review the Features of the Entity

    and Talk About it In Your Content •Ultimately, the process is the same. Work the discussion entities, their attributes and related entities into your content in all the relevant places in your content.
  110. 18 3 183 If it’s not an entity that Google

    recognizes, it’s not worth optimizing for.
  111. 18 4 18 4 Get Entity SEO Tools Into the

  112. 18 5 18 5 Quick Tool: Reviewing Entity Salience HOW

    SEARCH ENGINES REALLY WORK IN 2023 I whipped up a quick tool in Colab where you can see how entities are appearing in your own content. You can put text, upload a file, or select URL. Compare the usage of entities in your content with your competitors. https://colab.research.google.com/drive/18QXrdAPoKhUl76gGzuxk_vDiUqMeRyqx?usp=sharing
  113. 18 9 18 9 If you’re using ChatGPT, you need

    AIPRM for prompt management. https://www.aiprm.com
  114. 19 1 191 If your prompt is just one sentence,

    don’t be surprised when you get garbage back.
  115. 19 2 19 2 Remember You Need to Build Around

    a Content Strategy Read more about this approach at https://ipullrank.com/generative-ai-content-strategy
  116. 19 5 195 We are firmly in a semantic search

    environment. We need to stop operating from the lexical model.
  117. 19 6 19 6 The Things You Should Do Don’t

    Use qualitative measures in the places where Google is using quantitative measures Use tools that calculate embeddings Improve the management of your XML sitemaps Leverage generative AI to scale content optimization Build links contextually Start, actually using entities
  118. Mike King Founder / CEO @iPullRank Thank You | Q&A

    [email protected] Award Winning, #GirlDad Featured by Download the AI Guide: https://ipullrank.com/ai-seo-guide Use Orbitwise: https://ipullrank.com/tools/orbitwise