➢ XML sitemaps ➢ Other sources? • Crawl queue management; ➢ De-duplication based on URL patterns ➢ Crawl prioritisation & scheduling • Crawling; ➢ Fetching raw HTML ➢ Crawl ‘politeness’
pages; use robots.txt to block pages or resources that you don't want Google to crawl at all. Google won't shift this newly available crawl budget to other pages unless Google is already hitting your site's serving limit.
just HTML pages ➢ Reduce HTTP requests per page • AdsBot can consume crawl budget; ➢ Double-check your Google Ads campaigns • Link equity (PageRank) impacts crawl budget; ➢ More link equity = more crawl budget
• Index selection; ➢ De-duping prior to indexing • Indexing; ➢ First-pass based on HTML ➢ Potential rendering (not guaranteed) • Index integrity; ➢ Canonicalisation & de-duplication
➢ Global coverage with edge nodes worldwide ➢ Usually also results in faster crawling and better CWV • You manipulate your CDN cached pages; ➢ Cloud Workers enable a range of functionality • Googlebot crawls the changed CDN-cached pages; ➢ Your ‘original’ website remains unchanged ➢ Google only sees the changed CDN webpages
lengthy queues ➢ ‘Ask forgiveness, not permission’ • No CMS constraints; ➢ Change pages directly regardless of your CMS capabilities • Testing; ➢ Perform narrow tests on specific site sections ➢ A/B testing for SEO