Managing Googlebot's Greed

Managing Googlebot’s Greed Optimise for Efficient Crawling and Indexing Barry
Adams 15 May 2025

Barry Adams ➢ Active in SEO since 1998 ➢ Specialist
in SEO for News Publishers ➢ Newsletter: SEOforGoogleNews.com ➢ Co-founder of the News & Editorial SEO Summit

Google’s Three Main Processes Crawling Indexing Serving

Technical SEO Crawling Indexing

Today’s Talk Crawling

Google’s Model Crawl Queue Crawling Processing Index Render Queue Rendering
Serving Rendered HTML URL HTML

Three Layers of Index Storage Crawl Queue Crawling Processing Index
Render Queue Rendering Index Index Serving

Three Layers of Index Storage 1. RAM storage; ➢Pages that
need to be served quickly and frequently ➢Includes news articles, popular content, high- traffic URLs 2. SSD storage; ➢Pages that are regularly served in SERPs but aren’t super popular 3. HDD storage; ➢Pages that are rarely (if ever) served in SERPs

Three ‘layers’ of Googlebot? Crawl Queue Crawling Processing Index Render
Queue Rendering Index Index Serving Crawling Crawling Crawl Queue Crawl Queue

Three Indices… Three Crawl Queues? Priority crawl queue Regular crawl
queue Legacy crawl queue RAM storage SSD storage HDD storage

Priority Crawl Queue • Crawls VIPs ➢ Very Important Pages;
Webpages that have a high change frequency and/or are seen as highly authoritative News website homepages & key section pages Highly volatile classified portals (jobs, properties) Large volume ecommerce (Amazon, eBay, Etsy) • Main purpose = discovery of valuable new content; ➢ i.e., news articles, new product pages, new classified listings • Rarely re-crawls newly discovered URLs; ➢ New URLs can become VIPs over time

Regular Crawl Queue • Google’s main crawling; ➢ Does most
of the heavy lifting • Less frantic; ➢ More time for crawl selection, de-duplication, sanitisation of the crawl queue • Recrawls URLs first crawled by the Priority crawl queue; ➢ Checks for changes ➢ Updates relevant signals for next crawl prioritisation

Legacy Crawl Queue • Crawls VUPs; ➢ Very Unimportant Pages;
URLs that have very little link value and/or are very rarely updated ➢ Recrawls URLs that serve 4XX errors; -Likely also occasionally checks old redirects

It’s probably more complicated Priority crawl queue Regular crawl queue
Legacy crawl queue RAM storage SSD storage HDD storage

Crawl Sources • Site crawl • Feeds & XML sitemaps
• Inbound links • DNS records • Domain registrations • Browsing data?

URLs are Sacred • Search Engines don’t crawl, index, rank
pages or content… • They crawl, index, and rank URLs. • One piece of content = one URL

Crawlable Links https://developers.google.com/search/docs/crawling- indexing/links-crawlable

Crawl Management • Robots.txt Disallow; ➢ Strongest crawl management signal
➢ Evaporates crawl budget

https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget

Don't use robots.txt to temporarily reallocate crawl budget for other
pages; use robots.txt to block pages or resources that you don't want Google to crawl at all. Google won't shift this newly available crawl budget to other pages unless Google is already hitting your site's serving limit. https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget

Robots.txt prevents crawling… … but not indexing! • Links to
blocked URLs are still crawled • Their anchor texts carry relevancy for indexing

• Canonicals & noindex are NOT crawl management; ➢ Google
needs to see meta tags before it can act on them ➢ That means Googlebot still crawls those URLs Crawl Management vs Index Management

What about ‘rel=nofollow’? https://developers.google.com/search/blog/2019/09/ evolving-nofollow-new-ways-to-identify

What about ‘rel=nofollow’? All the link attributes—sponsored, ugc, and nofollow—are
treated as hints about which links to consider or exclude within Search. https://developers.google.com/search/blog/2019/09/ evolving-nofollow-new-ways-to-identify

Optimise Crawling • Server Response Time

GSC Crawl Stats

Page Resource Load

Page Resources

Multiple Hostnames

Crawl Budget Per Hostname • Every hostname has its own
allocation of crawl budget; ➢ ‘www.website.com’ is independently crawled from ‘cdn.website.com’ • Offload page resources to a subdomain; ➢ Frees up crawl budget on your main domain

Googlebot & AdsBot

AdsBot does not obey ‘User-Agent: *’

Sitewide Changes

Sitewide Changes • Googlebot detects large scale changes on a
website; ➢ Crawl rate will temporarily be increased ➢ Ensures the changes are rapidly reflected in the index • Enable your hosting to handle increased crawl rate; ➢ Temporarily increase server capacity after a sitewide change ➢ Sitewide changes include: -Redesign -Site migrations -Large number of new URLs

https://developers.google.com/search/docs/crawling- indexing/http-network-errors

Optimise Crawling; 3xx

GSC hack: Inspect URL follows redirects

Summarised

Manage Googlebot’s Greed • ALL web resources are crawled by
Googlebot; ➢ Not just HTML pages ➢ Reduce amount of HTTP requests per page • Link equity (PageRank) impacts crawling; ➢ More link value = more crawling ➢ Elevate key pages to VIPS • Each hostname has its own crawl budget; ➢ Offload resources to subdomains • AdsBot can use up crawl requests; ➢ Double-check your Google Ads campaigns • Serve correct HTTP status codes; ➢ Googlebot will adapt accordingly

Thank You [email protected] https://www.linkedin.com/in/barryadams/ https://www.SEOforGoogleNews.com/

Managing Googlebot's Greed

Managing Googlebot's Greed

More Decks by Barry Adams

Other Decks in Marketing & SEO

Featured

Transcript