Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Managing Googlebot's Greed

Managing Googlebot's Greed

Slides from my May 2025 talk at Hive MCR in Manchester. I spoke about the technical SEO foundations of Googlebot crawling, some theories on different levels of crawl queues, and tips on maximising crawl efficiency.

Avatar for Barry Adams

Barry Adams

May 19, 2025
Tweet

More Decks by Barry Adams

Other Decks in Marketing & SEO

Transcript

  1. Barry Adams ➢ Active in SEO since 1998 ➢ Specialist

    in SEO for News Publishers ➢ Newsletter: SEOforGoogleNews.com ➢ Co-founder of the News & Editorial SEO Summit
  2. Three Layers of Index Storage Crawl Queue Crawling Processing Index

    Render Queue Rendering Index Index Serving
  3. Three Layers of Index Storage 1. RAM storage; ➢Pages that

    need to be served quickly and frequently ➢Includes news articles, popular content, high- traffic URLs 2. SSD storage; ➢Pages that are regularly served in SERPs but aren’t super popular 3. HDD storage; ➢Pages that are rarely (if ever) served in SERPs
  4. Three ‘layers’ of Googlebot? Crawl Queue Crawling Processing Index Render

    Queue Rendering Index Index Serving Crawling Crawling Crawl Queue Crawl Queue
  5. Three Indices… Three Crawl Queues? Priority crawl queue Regular crawl

    queue Legacy crawl queue RAM storage SSD storage HDD storage
  6. Priority Crawl Queue • Crawls VIPs ➢ Very Important Pages;

    Webpages that have a high change frequency and/or are seen as highly authoritative News website homepages & key section pages Highly volatile classified portals (jobs, properties) Large volume ecommerce (Amazon, eBay, Etsy) • Main purpose = discovery of valuable new content; ➢ i.e., news articles, new product pages, new classified listings • Rarely re-crawls newly discovered URLs; ➢ New URLs can become VIPs over time
  7. Regular Crawl Queue • Google’s main crawling; ➢ Does most

    of the heavy lifting • Less frantic; ➢ More time for crawl selection, de-duplication, sanitisation of the crawl queue • Recrawls URLs first crawled by the Priority crawl queue; ➢ Checks for changes ➢ Updates relevant signals for next crawl prioritisation
  8. Legacy Crawl Queue • Crawls VUPs; ➢ Very Unimportant Pages;

    URLs that have very little link value and/or are very rarely updated ➢ Recrawls URLs that serve 4XX errors; -Likely also occasionally checks old redirects
  9. It’s probably more complicated Priority crawl queue Regular crawl queue

    Legacy crawl queue RAM storage SSD storage HDD storage
  10. Crawl Sources • Site crawl • Feeds & XML sitemaps

    • Inbound links • DNS records • Domain registrations • Browsing data?
  11. URLs are Sacred • Search Engines don’t crawl, index, rank

    pages or content… • They crawl, index, and rank URLs. • One piece of content = one URL
  12. Don't use robots.txt to temporarily reallocate crawl budget for other

    pages; use robots.txt to block pages or resources that you don't want Google to crawl at all. Google won't shift this newly available crawl budget to other pages unless Google is already hitting your site's serving limit. https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget
  13. Robots.txt prevents crawling… … but not indexing! • Links to

    blocked URLs are still crawled • Their anchor texts carry relevancy for indexing
  14. • Canonicals & noindex are NOT crawl management; ➢ Google

    needs to see meta tags before it can act on them ➢ That means Googlebot still crawls those URLs Crawl Management vs Index Management
  15. What about ‘rel=nofollow’? All the link attributes—sponsored, ugc, and nofollow—are

    treated as hints about which links to consider or exclude within Search. https://developers.google.com/search/blog/2019/09/ evolving-nofollow-new-ways-to-identify
  16. Crawl Budget Per Hostname • Every hostname has its own

    allocation of crawl budget; ➢ ‘www.website.com’ is independently crawled from ‘cdn.website.com’ • Offload page resources to a subdomain; ➢ Frees up crawl budget on your main domain
  17. Sitewide Changes • Googlebot detects large scale changes on a

    website; ➢ Crawl rate will temporarily be increased ➢ Ensures the changes are rapidly reflected in the index • Enable your hosting to handle increased crawl rate; ➢ Temporarily increase server capacity after a sitewide change ➢ Sitewide changes include: -Redesign -Site migrations -Large number of new URLs
  18. Manage Googlebot’s Greed • ALL web resources are crawled by

    Googlebot; ➢ Not just HTML pages ➢ Reduce amount of HTTP requests per page • Link equity (PageRank) impacts crawling; ➢ More link value = more crawling ➢ Elevate key pages to VIPS • Each hostname has its own crawl budget; ➢ Offload resources to subdomains • AdsBot can use up crawl requests; ➢ Double-check your Google Ads campaigns • Serve correct HTTP status codes; ➢ Googlebot will adapt accordingly