Michael King - Accounting for Gaps in SEO Software

2 Y’all know how I do. Download the slides: https://speakerdeck.com/ipullrank/

3 Salutations! I’m Mike King (@iPullRank)

5 SEO Editor Tools Don’t Make Sense. …and other thoughts

6 6 If You Use a Tool That Does This,
It Does Not Make Sense

7 7 Searchmetrics Content Experience was the only one to
get it right.

8 I made the mistake of giving Marcus Tober a
compliment the other day…

9 9 It Was So Much Better Than Everything Other
SEO Editor

10 10 Phrase Based Indexing + Query Expansion Means Google
Is Considering Content Across the Topic Cluster for Posting Lists

11 This is what (was) happening with AI Overviews AI
Overviews also look at related queries, not just your primary search query. h/t https://richsanger.com/google-ai-overvi ews-do-ranking-studies-tell-the-whole-st ory/

12 12 All the Other Content Editor Tools Look Vertically
at a SERP for Just the Target Keyword and Basically just Do TF-IDF

13 Here’s the Right Way 1. Build a graph of
keywords based on the target keyword 2. Crawl top 10 rankings for all keywords across the graph 3. Extract features for entities and co-occurring terms 4. Compare against your page 5. Optimize Colab: https://colab.research.google.com/drive /1s5bB0vVsTFFTVfJz1ZUiwcKpyK7vsm y8#scrollTo=mDJvmCtTxucv

14 Is Anyone Looking at This? Google is telling you
why results rank based on speciﬁc lexical, semantic, and links.

15 15 Is Anyone Looking at This? Lexical Query expansion
or semantic

16 16 Is there a rank tracking tool that gives
us that data?

18 18 The Fraggles Show What AIO Used for the
AI Snapshot

19 19 Scroll to Text

20 20 Fraggle Relevance Relevance against the chunks to keyword:
Relevance against AI Snapshot:

21 21 Is there a rank tracking tool that gives
us that data?

22 22 The reality is the existence of “Python SEO”
is a function of the failings of SEO software.

23 Let’s break it down to build it back up

24 24 Real quick. Let’s talk about how search engines
work.

25 25 Search Engines Work based on the Vector Space
Model Documents and queries are plotted in multidimensional vector space. The closer a document vector is to a query vector, the more relevant it is.

26 26 TF-IDF Vectors The vectors in the vector space
model were built from TF-IDF. These were simplistic based on the Bag-of-Words model and they did not do much to encapsulate meaning.

27 27 Relevance is a Function of Cosine Similarity When
we talk about relevance, it’s the question of similar is determined by how similar the vectors are between documents and queries. This is a quantitative measure, not the qualitative idea of how we typically think of relevance.

28 It Does Not Have to be Cosine Similarity There
are a lot of ways to compute nearest neighbor distance. Cosine Similarity is the most popular with Euclidean distance being the second, but there are many distance functions to be considered.

29 29 In SEO we still think Google is Here

30 30 The lexical model counts the presence and distribution
of words. Whereas the semantic model captures meaning. This was the huge quantum leap behind Google’s Hummingbird update and most SEO software has been behind for over a decade. Google Shifted from just Lexical to include Semantic a Decade Ago

31 Word2Vec Gave Us Embeddings Word2Vec was an innovation led
by Tomas Mikolov and Jeff Dean that yielded an improvement in natural language understanding by using neural networks to compute word vectors. These were better at capturing meaning. Many follow-on innovations like Sentence2Vec and Doc2Vec would follow.

32 32 Vector Embeddings = Words Converted to Multi-dimensional Coordinates
in Vector Space

33 33 We Went from Sparse Representations to Dense Representations

34 34 This Allows for Mathematical Operations Comparisons of content
and keywords become linear algebraic operations.

35 35 Word2Vec Gave Us Hummingbird

36 36 Google Has Been More Like This Since Hummingbird

37 37 This is a huge problem because most SEO
software still operates on the lexical model.

Check out MarketBrew’s Free Tool to Help

39 39 8 Google Employees Are Responsible for Generative AI
https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/

40 40 The Transformer The transformer is a deep learning
model used in natural language processing (NLP) that relies on self-attention mechanisms to process sequences of data simultaneously, improving eﬃciency and understanding in tasks like translation and text generation. Its architecture enables it to capture complex relationships within the text, making it a foundational model for many state-of-the-art NLP applications.

41 41 The Transformer Gave us BERT

42 42 Word2Vec Captured Relationship, but Not Context – BERT
Captures Context

43 43 Since the Introduction of BERT, Google Has Done
Hybrid Retrieval

44 44 Google Has Been Using Public About Hybrid Models
Since 2020 This is why some of the search results feel so weird. A re-ranking of documents with a mix of lexical and semantic. https://arxiv.org/pdf/2010.01195.pdf

45 45 Hybrid retrieval is a big part of why
you often can’t tell why something ranks.

46 46 Semantic Retrieval Has Higher Recall than Lexical, Combining
them has Higher Recall than Both

47 Dense Retrieval You remember “passage ranking?” This is built
on the concept of dense retrieval wherein there are more embeddings representing more of the query and the document to uncover deeper meaning.

48 48 Dense Retrieval is Scoring down to the Sentence
Level

49 49 Introducing Google’s Version of Dense Retrieval Google introduces
the idea of “aspect embeddings” which is series of embeddings that represent the full elements of both the query and the document and give stronger access to deeper information.

50 50 Dense Representations for Entities Google has improved its
entity resolution using embeddings giving them stronger access to information in documents.

51 51 Embeddings = Google really understands content relevance now.

52 Website Representation Vectors Just as there are representations of
pages as embeddings, there are vectors representing websites and Google has recently made improvements in understanding when content is not relevant to a given site.

53 Author Vectors Similarly, Google has Author Vectors wherein they
are able to identify an author and the subject matter that they discuss. This allows them to ﬁngerprint an author and their expertise.

54 54 So, really E-E-A-T is a function of information
associated with vector representations of websites and authors.

55 55 Under the SGE/AIO Model, Google is Structured Liked
This

56 56 Embeddings are one of the most fascinating things
we can leverage as SEOs to catch up to what Google is doing. SEO tools don’t use them.

57 So, It Turns Out Google is All About the
Clicks

58 58 Google’s Algorithms Inner Workings Have Been Put on
Full Display Lately Through a combination of what’s come out of Google’s DOJ antitrust trial and the Google API documentation leak, we have a much clearer picture of how Google actually functions.

59 59 I was the First to Publish on the
Google Leak, but it was a Team Effort

60 60 We Now Have a Much Stronger Understanding of
the Architecture https://searchengineland.com/how-google-search-ranking-works-445141

61 61 These speciﬁcs help us zero in on what
really matters.

62 The Primary Takeaway is the Value of User Behavior
in Organic Search Google’s Navboost system keeps track of user click behavior and uses that to inform what should rank in various contexts.

63 63 Google Has Denied this Many Times

64 64 Many times…

65 65 But it’s Been Conﬁrmed in Pandu Nayak’s Testimony

66 66 And in Google’s documentation for their Cloud Search
services

67 It Looks Like Google is Using Clicks After All…

68 68 I remain adamant that both Google and the
SEO community owes @randﬁsh an apology.

70 70 That’s Also Why Google Was So Mad when
@DejanSEO Did This…

71 71 User Click Data is What Makes Google More
Powerful Than Any Other Search Engine The court opinion in the DoJ Antitrust trial, Google’s leaked documents, and Google’s own internal documentation all support the fact that click behavior is what makes Google perform the way that it does.

72 72 There are only Two Tools that we have
with Clickstream Data Needed Here

73 73 You Can Buy the Data that Powers .Trends

74 74 There are more leaks coming very soon.

75 Hanns Kronenberg Found IR Metrics in Google Hanns found,
tracked, and optimized for: • IRScore - The ﬁnal result of Google’s scoring functions • nav_fraction - Expected CTR • irscore_pre_twiddle - Initial Ascorer value The leak has since been plugged by Google.

77 77 Hanns Ran Various Experiments To Learn How to
Manipulate the Scores

78 78 He Found that Nav Fraction Metric Updates Slowly
in Alignment with What We Know about Navboost and Glue

79 79 Re-Ranking

81 81 Re-Ranking via Learning to Rank

83 83 How Users are Quantiﬁed in Click Models

84 84 User Based Modeling Looks at the Attractiveness of
Results In Addition to the Clicks

85 TF-Ranking One of Googles Learning to Rank Mechanisms Google
has expectations of performance for every position in the SERP. The user behavior signals collected reinforce what should rank and demote what doesn’t perform just like a social media channel. The best way to scale this is by generating highly-relevant content with a strong user experience.

86 86 13 Months of Google Data = 17 Years
of Bing Data (Sorry, Fabrice)

87 87 You Need to Keep People on your Website
to Win at Organic Search

88 Modern SEO Needs UX Baked-in Google has expectations of
performance for every position in the SERP. The user behavior signals collected reinforce what should rank and demote what doesn’t perform just like a social media channel. The best way to scale this is by generating highly-relevant content with a strong user experience.

89 All These Leaks Tell Us That We Need New
Metrics

90 90 Measuring that telemetry is exactly what I want
us to do more of.

91 91 We Should agree the 200 Ranking Factors THing
is Dead Coincidentally, Semrush asked me to help them update this article. It’s so wrong that I told them to start from scratch.

92 Site Embeddings Are Used to Measure How On Topic
a Page is Google is speciﬁcally vectorizing pages and sites and comparing the page embeddings to the site embeddings to see how off-topic the page is. Learn more about embeddings: https://ipullrank.com/content-relevance

93 93 Google’s Vertex AI Has Different Embeddings for Documents
and Queries

94 MixedBread’s Open Source Embeddings are Highly Performant Last week
@dejanseo shared his research on how MixedBread’s embedding models perform better than anything else for his SEO use cases. He also talked about lowering the dimensionality and converting them to binary representations to save space.

95 95 You Can Compute This Too I’m using the
embeddings with cosine similarity and clustering to examine two ways that how pages relate or don’t relate to the site average of the embeddings. Notice how my recent posts on AI related topics for SEO all have high PageSiteSimilarity whereas my post about MozCon from 2011 does not.

96 Check out the Colab This uses Requests, Traﬁlatura, Transformers,
PyTorch, Scikit Learn, and Sentence Transfomrers to compute SiteScore and a dataframe of cosine similarities and cluster based scores for all URLs crawled. https://colab.research.google.com/drive /19PJiFmv8oyjhB-jwzEK9TPlbfK-xB573 You can remove the outliers to improve your site focus score. Add this to your content audits.

97 97 When I Run it on my whole Sitemap,
My Site is Not Very Focused

98 98 Hanns Discoveries and the Leak Have Given Us
a Lot of Data to Consider

99 99 Mark Has a Lot More Data to Share;
He’ll be Sharing it at SERPConf Later This Month

10 0 Mark Has a SHIT LOAD of data Data
from 90 million SERPs to be exact.

10 1 10 1 He Hasn’t Shared Much with Me,
But There are Some Interesting Ones

10 2 10 2 Here Are Some Data Points He
Has for PAAs

10 3 103 What I love most is that these
data points can give us more clarity.

10 4 10 4 MFs Love to Throw Around “Information
Gain” Conceptually, as it relates to search engines, Information Gain is the measure of how much unique information a given document adds to the ranking set of documents. In other words, what are you talking about that your competitors are not?

10 5 10 5 Most SEOs Are Referencing This Patent
This patent actually talks more about re-ranking based on results a user has previously clicked on.

10 6 10 6 Information Gain was Discussed as Early
as Phrase-based Indexing as a function of co-occurrence used for prediction of relevance

10 7 10 7 Information Gain is Described Here a
Probability of Relationship

10 8 10 8 This Could Also be Represented in
the Concept of Consensus

10 9 Here’s How I Calculate Information Gain I calculate
it as a function of embeddings comparisons across the SERP and identifying the unique entities on each page to help inform what to do. Note: This is an old version of the code, we use Mixed Bread embeddings for this now.

110 11 0 This Allows you to Quantify which Pages
Have Less Mutual Information

111 11 1 The Source Type Metric A metric called
sourceType that shows a loose relationship between the where a page is indexed and how valuable it is. For quick background, Google’s index is stratified into tiers where the most important, regularly updated, and accessed content is stored in flash memory. Less important content is stored on solid state drives, and irregularly updated content is stored on standard hard drives. The higher the tier, the more valuable the link. Pages that are considered “fresh” are also considered high quality. Suffice it to say, you want your links to come from pages that either fresh or are otherwise featured in the top tier. Get links from pages that live in the higher tier by modeling a composite score based on data that is available.

11 2 Source Type is a Proxy Metric Using weighed
rankings, traﬃc, URL Rating, and Domain Rating, I build composite metric to estimate where in Google’s tiered index the page may live.

113 11 3 There are Gold Standard Documents There is
no indication of what this means, but the description mentions “human-labeled documents” versus “automatically labeled annotations.” I wonder if this is a function of quality ratings, but Google says quality ratings don’t impact rankings. So, we may never know. 🤔

114 Measure Your Content Against the Quality Rater Guidelines Elias
Dabbas created a python script and tool that uses the Helpful Content Recommendations to show a proof of concept way to analyze your articles. We’d use the Search Quality Rater Guidelines which serve as the Golden Document standard. I’ll be turning this into a golden document metric soon. Code: https://blog.adver.tools/posts/llm-content-evaluation/ Tools: https://adver.tools/llm-content-evaluation/

115 What I’m building is a Python library for computing
as many of the meaningful metrics that Google that is deriving. This will help us expand our understanding of why things rank the way they do. https://github.com/ipullrank/search-telemetry The Search Telemetry Project

116 11 6 Here’s a bonus metric.

11 7 11 7 Pruning and Optimization Work Quite Well
Together

11 8 11 8 Content Decay The web is a
rapidly changing organism. Google always wants the most relevant content, with the best user experience, and most authority. Unless you stay on top of these measures, you will see traﬃc fall off over time. Measuring this content decay is as simple comparing page performance period over period in analytics or GSC. Just knowing content has decayed is not enough to be strategic.

11 9 11 9 It’s not enough to know that
the page has lost traﬃc.

12 0 12 0

12 1 12 1 The Content Potential Rating (CPR).

12 2 12 2 Content Potential Score

12 3 12 3 Interpreting the Content Potential Rating 80
- 100: High Priority for Optimization 60 - 79: Moderate Priority for Optimization 40 - 59: Selective Optimization 20 - 39: Low Priority for Optimization 0 - 19: Minimal Benefit from Optimization If you want quick and dirty, you can prune everything below a 40 that is not driving significant traffic.

12 4 12 4 Combining CPR with pages that lost
traﬃc helps you understand if it’s worth it to optimize.

12 5 12 5 Step 1. Pull the Rankings Data
from Semrush Organic Research > Positions > Export

12 6 12 6 Step 2: Pull the Decaying Content
from GSC Google Search Console is a great source to spot Content Decay by comparing the last three months year over year. Filter for those pages where the Click Difference is negative (smaller than 0) then export.

12 7 12 7 Step 3: Drop them in the
Spreadsheet and Press the Magic Button

12 8 The Output is List of URLs Prioritized by
Action Each URL is marked as Keep, Revise, Kill or Review based on the keyword opportunities available and the effort required to capitalize on them. Sorting the URLs marked as “Revise” by Aggregated SV and CPR will give you the best opportunities ﬁrst.

12 9 12 9 Get your copy of the Content
Pruning Workbook : https://ipullrank.com/cpr-sheet

13 0 13 0 Add this data to your content
audits to make data-driven decisions of what to cut.

13 1 SEO Needs Technical Standards

13 2 13 2

13 3 13 3

13 4 13 4 In SEO Our Outputs are ALL
OVER THE PLACE!!! One provider oddly returns a semi-colon separated CSV via its API. While another provides JSON. The data is the same, but formatted dramatically differently. WHY?!

13 5 135 You can’t easily switch platforms because our
data is not portable

13 6 13 6 All the Link Indices Crawl a
Different Subset of the Web, but there’s no real way to Consolidate or Compare the Data The metrics are also so wildly different and much of what Google is looking at is not accounted for.

13 7 13 7 https://searchengineland.com/seo-tools-industry-devel op-technical-standards-264983

13 8 13 8 The Gateway Speciﬁcation This is a
draft of standards for data portability and open link metrics. https://github.com/ipullrank/gateway Contribute!

13 9 139 Everyone is welcome to contribute. Let’s get
together and make things better.

14 0 Our Data Should be Free and It Should
be Better

14 1 141 I’ve always felt rankings and link data
should be free, but compute and storage cost money.

14 2 14 2

14 3 14 3

14 4 14 4

14 5 Nodes are Hosted on Trusted Machines Each node
is a simple TSR that downloads lists of URLs, crawls and extracts information and phones it home to Majestic.

14 6 14 6 I decompiled it to see how
hard it might be to build. This is the crawl function.

14 7 14 7 This is the phone home function.

14 8 148 What if we replicated that for rankings,
links, and an embeddings index, but used the nodes for storage too?

14 9 We Could Mirror the Spanner Architecture Google uses
many distributed machines as a single machine using the Spanner architecture. We could mirror this idea by building a network of trusted SEOs who run redundant nodes on their machines for crawling and storage.

15 0 150 The OpenSearch Initiative.

15 1 Coming Soon… Although I have been working furiously
on this in Cursor, I decided to drink with y’all instead of ﬁnishing it in time for this.

15 2 152 NYC SEO Week - 4/14 - 4/18
www.seoweek.org

Contact me if you want to get better results from
your SEO: [email protected] Thank You | Q&A Award Winning, #GirlDad Featured by Download the Slides: https://speakerdeck.com/ipullrank Mike King Chief Executive Oﬃcer @iPullRank

15 4 Let’s Talk About RAG

15 5 15 5 The Three Laws of Generative AI
content 1. Generative AI is not the end-all-be-all solution. It is not the replacement for a content strategy or your content team. 2. Generative AI for content creation should be a force multiplier to be utilized to improve workﬂow and augment strategy. 3. You should consider generative AI content for awareness efforts, but continue to leverage subject matter experts for lower funnel content.

15 6 15 6 Retrieval Augmented Generation (Thank you Facebook)

15 7 15 7 After Data Curation, your Chunking Strategy
is the Most Important Part

15 8 15 8

15 9 15 9 Llama Index - RAG - https://www.llamaindex.ai/

16 0 16 0 It’s Not Diﬃcult to Build with
Llama Index sitemap_url = "[SITEMAP URL]" sitemap = adv.sitemap_to_df(sitemap_url) urls_to_crawl = sitemap['loc'].tolist() ... # Make an index from your documents index = VectorStoreIndex.from_documents(documents) # Setup your index for citations query_engine = CitationQueryEngine.from_args( index, # indicate how many document chunks it should return similarity_top_k=5, # here we can control how granular citation sources are, the default is 512 citation_chunk_size=155, ) response = query_engine.query("YOUR PROMPT HERE")

16 2 16 2

16 3 163 Generate content on the component level, not
just full ad hoc documents.

16 4 16 4

16 5 16 5

16 6 16 6 LangChain - Build Agents - https://www.langchain.com/

16 7 16 7 LangFuse - Prompt Management & Observability
tool - https://langfuse.com/

16 8 16 8 Literal AI

16 9 169 You don’t need to code for any
of this.

17 0 17 0

17 1 17 1

17 2 17 2

17 3 173

17 4 174

17 5 175

17 6 17 6 Bubble - No Code Apps -
https://bubble.io/

17 7 Integrate Promptitude with Zapier or Make

17 8 Did you really think I wasn’t going to
talk about AI?

17 9 With AI, I’m giving y’all legos. What you
build is up to you, but I’m going to show things to consider.

18 0 Setting Up LLMs Locally You don’t need ChatGPT
anymore

18 1 18 1

18 2 LLaMa 3.2 is SOTA on Several Benchmarks Facebook’s
open source model is outperforming the best closed-source models on a variety of different evaluation metrics. New open source models pop up weekly that continue to shift the state of the art.

18 3 18 3

18 4 18 4

18 5 18 5

18 6 18 6

18 7 18 7

18 8 188 You can now unlock state of the
art generative AI use cases from your laptop for free.

18 9 Make Sure You Hook It Up To Your
GPU On a Windows machine you’ll need to go to the NVIDIA Control Panel and add the Ollama server application under Manage 3D Settings.

19 0 19 0 Octoparse - Combine a scraper with
Generative AI - https://www.octoparse.ai/

19 1 19 1 Respona - AI-enabed Link Building -
https://www.respona.com

Contact me if you want to get better results from
your SEO: [email protected] Thank You | Q&A Award Winning, #GirlDad Featured by Download the Slides: https://speakerdeck.com/ipullrank Mike King Chief Executive Oﬃcer @iPullRank

Michael King - Accounting for Gaps in SEO Software

Michael King - Accounting for Gaps in SEO Software

More Decks by Tech SEO Connect

Featured

Transcript