$30 off During Our Annual Pro Sale. View Details »

Technical SEO for Publishing Sites in 2023

Technical SEO for Publishing Sites in 2023

Slides from my talk at the 2023 News and Editorial SEO Summit, where I looked at the current state of technical SEO for news and media websites.

Barry Adams

February 21, 2024
Tweet

More Decks by Barry Adams

Other Decks in Marketing & SEO

Transcript

  1. #NESS23 Advance Warning “The whole problem with the world is

    that fools and fanatics are always so certain of themselves and wiser people so full of doubts.” - Bertrand Russell
  2. #NESS23 Three ‘layers’ of Googlebot? Crawling Processing Render Queue Rendering

    Crawling Crawling Index Crawl Queue Crawl Queue Crawl Queue
  3. #NESS23 Realtime Crawler • Crawls VIPs ➢ Very Important Pages;

    Webpages that have a high change frequency and/or are seen as highly authoritative News website homepages & key section pages • Main purpose = discovery of valuable new content; ➢ i.e., news articles • Rarely re-crawls newly discovered URLs; ➢ New URLs can become VIPs over time
  4. #NESS23 Regular Crawler • Google’s main crawler; ➢ Does most

    of the hard work ➢ Probably the crawler that also fetches page resources
  5. #NESS23 Legacy Content Crawler • Crawls VUPs ➢ Very Unimportant

    Pages; URLs that have very little link value and/or are very rarely updated ➢ Recrawls URLs that serve 4XX errors -Likely also occasionally checks old redirects
  6. #NESS23 Key Take-Away: • Realtime Crawler crawls your article once;

    ➢It is then passed on to Regular Crawler ➢Regular Crawler will visit the URL several hours later ➢Any changes made after the first crawl are unlikely to be seen until then ➢By then the story is not news anymore – the news cycle has moved on • Consequence: ➢You usually get one chance to rank in Google’s news ecosystem ➢Get your SEO right before you click ‘Publish’ • Possible Exception: LiveBlogPosting articles
  7. #NESS23 Indexing and Rendering Crawl Queue Crawling Processing Index Render

    Queue Rendering Index Index URL HTML Rendered HTML
  8. #NESS23 Indexing and Rendering Rendering takes time, and news doesn’t

    have time. Indexing is initially with raw HTML only. Crawl Queue Crawling Processing Index Render Queue Rendering Index Index URL HTML
  9. #NESS23 Rendering isn’t the only shortcut… Google wants publishers to

    noindex syndicated content. Because Google sucks at identifying duplicate content. At least, it can’t de-duplicate quickly.
  10. #NESS23 Indexing is a multi-layered set of processes Render Queue

    Rendering Crawl Queue Crawling Processing Index Processing Processing Processing
  11. #NESS23 Known Indexing Processes • HTML Lexer; ➢ Tokenises HTML

    • Parser; ➢ Extracts content from HTML for indexing • Canonicaliser; ➢ Determines a URL’s canonical version • Deduplicator; ➢ Reduces the amount of identical content in the index • Pageranker; ➢ Calculates link value (FMA PageRank) for each URL • Many, many more…
  12. #NESS23 What about the Index itself? Render Queue Rendering Crawl

    Queue Crawling Processing Index Index Index
  13. #NESS23 Three Crawlers… Three Indices? Realtime crawler Regular crawler Legacy

    content crawler RAM storage SSD storage HDD storage
  14. #NESS23 Three Layers of Index Storage 1. RAM storage; ➢

    Pages that need to be served quickly and frequently Includes news articles but also popular content 2. SSD storage; ➢ Pages that are regularly served in SERPs but aren’t super popular 3. HDD storage; ➢ Pages that are rarely (if ever) served in SERPs
  15. #NESS23 It’s probably more complicated Realtime crawler Regular crawler Legacy

    content crawler RAM storage SSD storage HDD storage
  16. #NESS23 Key Take-Aways: 1. Make indexing easy for Googlebot; Put

    all your critical content in the HTML source Don’t rely on rendering to load valuable content 2. There’s no such thing as a duplicate content penalty; However, duplicate content on a single site means the site is competing with itself… and that’s stupid.
  17. #NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server

    Response Time - Aim for 600ms or faster
  18. #NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server

    Response Time ➢ Clean URLs - Never use tracking parameters on internal links https://www.website.com/news/article-123?recommended=1 https://www.website.com/news/article-123
  19. #NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server

    Response Time ➢ Clean URLs ➢ Lightweight pages; - Page resources consume crawl budget
  20. #NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server

    Response Time ➢ Clean URLs ➢ Lightweight pages ➢ Pagination; - Balance between paginated URLs and crawl waste
  21. #NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server

    Response Time ➢ Clean URLs ➢ Lightweight pages ➢ Pagination ➢ Correct HTTP status codes https://developers.google.com/search/docs/crawling-indexing/http-network-errors
  22. #NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server

    Response Time ➢ Clean URLs ➢ Lightweight pages ➢ Pagination ➢ Correct HTTP status codes ➢ AdsBot can be unruly
  23. #NESS23 The Basics - Indexing 2. Effortless indexing; ➢ Semantic

    HTML ➢ <h1> headlines <div class="headline">This is a bad way to code an article headline</div> <h1>This is a properly coded article headline</h1>
  24. #NESS23 The Basics - Indexing 2. Effortless indexing; ➢ Semantic

    HTML ➢ <h1> headlines ➢ Clean HTML in <head>
  25. #NESS23 The Basics - Indexing 2. Effortless indexing; ➢ Semantic

    HTML ➢ <h1> headlines ➢ Clean HTML in <head> ➢ Uninterrupted HTML in article <body>
  26. #NESS23 The Basics - Indexing 2. Effortless indexing; ➢ Semantic

    HTML ➢ <h1> headlines ➢ Clean HTML in <head> ➢ Uninterrupted HTML in article <body> ➢ Good structured data; - NewsArticle for articles - Person for author pages - Keep it lean, don’t over-annotate
  27. #NESS23 Google used to be Deterministic • Action A leads

    to ranking B; ➢Relatively simple crawling, indexing, and ranking systems ➢Few websites, low competition ➢Fairly predictable
  28. #NESS23 Google today is Probabilistic • Action A increases the

    probability of ranking B; ➢Massively complicated systems ➢Intensely competitive web ➢All SEO is geared towards maximising probabilities; - But… 99% probability still means 1% chance of it not happening
  29. #NESS23 Blocking LLMs • Robots.txt Disallow Rules: User-agent: CCbot Disallow:

    / User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: Google-Extended Disallow: /
  30. #NESS23 Content Pruning v Topic Authority • Your volume of

    (good) articles on a topic determines your topic authority • Topic authority = good visibility for your stories • Deleting old content could undermine your topic authority • Only delete bad content; ➢ Age and low traffic are not enough