Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AI-Powered Search from SMX Advanced Berlin

AI-Powered Search from SMX Advanced Berlin

Michael King

September 10, 2024
Tweet

More Decks by Michael King

Other Decks in Marketing & SEO

Transcript

  1. 1 1

  2. 4

  3. 7 7 Search Engines Work based on the Vector Space

    Model Documents and queries are plotted in multidimensional vector space. The closer a document vector is to a query vector, the more relevant it is.
  4. 8 8 TF-IDF Vectors The vectors in the vector space

    model were built from TF-IDF. These were simplistic based on the Bag-of-Words model and they did not do much to encapsulate meaning.
  5. 9 9 Relevance is a Function of Cosine Similarity When

    we talk about relevance, it’s the question of similar is determined by how similar the vectors are between documents and queries. This is a quantitative measure, not the qualitative idea of how we typically think of relevance.
  6. 11 11 The lexical model counts the presence and distribution

    of words. Whereas the semantic model captures meaning. This was the huge quantum leap behind Google’s Hummingbird update and most SEO software has been behind for over a decade. Google Shifted from Lexical to Semantic a Decade Ago
  7. 12 Word2Vec Gave Us Embeddings Word2Vec was an innovation led

    by Tomas Mikolov and Jeff Dean that yielded an improvement in natural language understanding by using neural networks to compute word vectors. These were better at capturing meaning. Many follow-on innovations like Sentence2Vec and Doc2Vec would follow.
  8. 14 14 Tomas Mikolov Led the Word2Vec Research Tomas is

    a Czech computer scientist behind many of these natural language understanding innovations.
  9. 15 He was accompanied by the Chuck Norris of Computer

    Science Jeff Dean Jeff Dean has been a part of nearly every major innovation that has powered Google Search.
  10. 17 17 This Allows for Mathematical Operations Comparisons of content

    and keywords become linear algebraic operations.
  11. 21 21 This is a huge problem because most SEO

    software still operates on the lexical model.
  12. 23 23 8 Google Employees Are Responsible for Generative AI

    https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/
  13. 24 24 The Transformer The transformer is a deep learning

    model used in natural language processing (NLP) that relies on self- attention mechanisms to process sequences of data simultaneously, improving efficiency and understanding in tasks like translation and text generation. Its architecture enables it to capture complex relationships within the text, making it a foundational model for many state-of-the-art NLP applications.
  14. 30 30 Embeddings are one of the most fascinating things

    we can leverage as SEOs to catch up to what Google is doing.
  15. 31 Site Embeddings Are Used to Measure How On Topic

    a Page is Google is specifically vectorizing pages and sites and comparing the page embeddings to the site embeddings to see how off-topic the page is. Learn more about embeddings: https://ipullrank.com/content-relevance
  16. 32 Content needs to be more focused We’ve learned definitively

    that Google uses vector embeddings to determine how far off given a page is from the rest of what you talk about. This indicates that it will be challenging to go far into upper funnel content successfully without a structured expansion or without authors who have demonstrated expertise in that subject area. Encourage your authors to cultivate expertise in what they publish across the web and treat their bylines like the gold standard that it is.
  17. 33 Build Topic Clusters Well defined topic clusters can position

    your website and brand into an authority in your space and strengthen your entities in the eyes of Google. A site that focuses on a series of topics that are relevant to each other are going to benefit in rankings. Here are a few tools that can help you design and build your topic clusters systematically. Thruuu https://thruuu.com/keyword-clustering-tool Keyword Insights https://www.keywordinsights.ai/features/keyword-clustering/
  18. 34 Let Screaming Frog Do the Heavy Lifting Generate embeddings

    while you crawl using Screaming Frog SEO Spider. Take the file to Colab and do the following things: Keyword - Landing Page Relevance Scoring Keyword Mapping Link Building Target Identification Redirect Mapping Internal Link Mapping https://ipullrank.com/vector- embeddings-is-all-you-need You can also work with your language model to combine crawl data with SERP data and do things like information gain calculations.
  19. 35 35 Leveraging generative AI is a combination of content

    strategy, your unique creative angles, and deep understanding of the technical nuances of a channel.
  20. 36 36 This is our opportunity to get up from

    the kids table. This is our opportunity to get up from the kids table
  21. 41 There’s A Lot of Discussion of ChatGPT Replacing Google

    The same is true for Tiktok, Perplexity, Bing’s CoPilot, and [insert new genAI search tool here].
  22. 42 42 40% of People Leaving ChatGPT Go to Google

    My assumption is that many of these people are fact-checking. That is a bad behavior to establish for nearly half of your users. This is also an indication that people are deeply aware of the issues related to hallucinations. In other words, people don’t trust the product.
  23. 43 43 Yes, TikTok is a (Nascent) Search Engine 41%

    of Tiktok users perform searches, but the search volume around a series of broad and meaningful queries is not there to make it more than a small supplement to Google Search.
  24. 44 44 21% of People Going to Tiktok Come from

    Google 24.5% of People Leaving Tiktok Go to Google
  25. 48 48 I wish OpenAI the best with this. It

    will be very difficult to supplant Google as the search engine of record.
  26. 51 51 Google is still the main event, but we

    are going back into a world where we need to optimize for multiple search engines across a series of channels.
  27. 53 53 Google’s Algorithms Inner Workings Have Been Put on

    Full Display Lately Through a combination of what’s come out of Google’s DOJ antitrust trial and the Google API documentation leak, we have a much clearer picture of how Google actually functions.
  28. 54 54 I was the First to Publish on the

    Google Leak, but it was a Team Effort
  29. 55 55 We Now Have a Much Stronger Understanding of

    the Architecture https://searchengineland.com/how-google-search-ranking-works-445141
  30. 57 The Primary Takeaway is the Value of User Behavior

    in Organic Search Google’s Navboost system keeps track of user click behavior and uses that to inform what should rank in various contexts.
  31. 63 63 I remain adamant that both Google and the

    SEO community owes @randfish an apology.
  32. 66 66 User Click Data is What Makes Google More

    Powerful Than Any Other Search Engine The court opinion in the DoJ Antitrust trial, Google’s leaked documents, and Google’s own internal documentation all support the fact that click behavior is what makes Google perform the way that it does.
  33. 69 Modern SEO Needs UX Baked-in Google has expectations of

    performance for every position in the SERP. The user behavior signals collected reinforce what should rank and demote what doesn’t perform just like a social media channel. The best way to scale this is by generating highly-relevant content with a strong user experience.
  34. 71 71 Navboost Makes it Clear that Paid and Organic

    both operate from the concept of expected CTR
  35. 72 72 Larry Kim Has Talked About Expected CTR and

    Quality Score Being a Normalized CTR Effectively, what Larry described many years ago is very similar to how NavBoost works in Organic Search. https://www.marketingprofs.com/articles/2014/25432/four-adwords-mistakes-that-drag-your-ctr-down Google has taken matters into its own hands and is using expected CTR to drive everything via Performance Max.
  36. 73 73 Google wants us to spend more and the

    way we do that is by getting marketers higher ROI, so Google is using generative AI to help you improve.
  37. 74 74 Marketers are Expecting Disruption of Search Marketing from

    GenAI https://www.emarketer.com/content/google-streamlines-ad-creation-with-new-ai- features-performance-max
  38. 75 75 Google is pushing everyone back to being strategic

    marketers rather than channel managers.
  39. 76 76 Google is pumping Generative AI all over PPC

    https://support.google.com/google-ads/answer/14150986?hl=en
  40. 77 77 Google is pumping Generative AI all over PPC

    - Creative Generation https://blog.google/products/ads-commerce/ai-powered-ads-google-marketing-live/
  41. 78 78 Performance Max - Search Themes Search themes in

    Performance Max campaigns will have the same prioritization as your phrase match and broad match keywords in Search campaigns. Exact match keywords that are identical to the search queries will continue to be prioritized over search themes and other keywords. Keep in mind that search themes are optional. You’ll also have access to tools like brand exclusions to help control the types of search traffic that Performance Max serves on. https://support.google.com/google-ads/answer/14179631?hl=en
  42. 81 81 Google Has Ad Units for AI Overviews, But

    No One Is Buying Them! Google has been offering these for a few months, but brands are not interested likely due to the brand safety issue that they represent.
  43. 83 83 Baidu’s AiAds Study Suggests This Concept is Here

    to Stay https://arxiv.org/pdf/1907.12118
  44. 84 Researchers Warn How it Can Go Off the Rails

    The study highlights how a lack of data transparency and algorithmic missteps, such as flawed conversion tracking, can lead to long-term negative impacts on ad performance. It shows how AI algorithms can misinterpret data errors, resulting in decreased impressions, clicks, and conversions, ultimately harming advertisers' ROI​. https://asistdl.onlinelibrary.wiley .com/doi/full/10.1002/asi.24798
  45. 85 85 Google’s own tools will be the most powerful

    because only they have the data on what users do within their ecosystem.
  46. 87 NavBoost Performance Starts at the SERP Itself Continually testing

    your metadata is a must. Check out the SearchPilot team’s case studies for ideas: https://www.searchpilot.com/resources /case-studies/tag/meta-data
  47. 88 88 Design Content for the Human Condition Design your

    content so it is easier to consume and it will yield better performance metrics. https://moz.com/blog/10-super-easy-seo- copywriting-tips-for-link-building
  48. 89 89 Google is Using Passage Indexing to Try to

    Drop the User Into the Right Spot
  49. 90 90 Use Logical Chunking To Get Users to the

    Information Faster https://www.nngroup.com/articles/in-page-links-content-navigation/
  50. 94 94 Build Pages that Are Easy to Parse Create

    semantically relevant content Build a table of contents Drop anchor links throughout the page to help Google understand where the user is meant to go.
  51. 95 Stop Leading With This I came for a recipe.

    Not your Grandma’s life story!
  52. 96 Less is More, More or Less It’s time to

    cut out the content madness
  53. 97 97 You Don’t Need Link Volume, You Need Link

    Quality Indexing Tier Impacts Link Value A metric called sourceType that shows a loose relationship between the where a page is indexed and how valuable it is. For quick background, Google’s index is stratified into tiers where the most important, regularly updated, and accessed content is stored in flash memory. Less important content is stored on solid state drives, and irregularly updated content is stored on standard hard drives. The higher the tier, the more valuable the link. Pages that are considered “fresh” are also considered high quality. Suffice it to say, you want your links to come from pages that either fresh or are otherwise featured in the top tier. Get links from pages that live in the higher tier by modeling a composite score based on data that is available.
  54. 99 99 Google Stores Your Content Like the Wayback Machine

    and Uses the Change History Google’s file system is capable of storing versions of pages over time similar to the Wayback Machine. My understanding of this is that Google keeps what it has indexed forever. This is one of the reasons you can’t simply redirect a page to an irrelevant target and expect the link equity to flow. The docs reinforce this idea implying that they keep all the changes they’ve ever seen for the page. You’re not going to get away with things by simply changing your pages once.
  55. 10 0 Indexing is Also Harder It’s not being talked

    about as much, but indexing has gotten a lot harder since the Helpful Content update. You’ll see a lot more pages in the “Discovered - currently not indexed” and “Crawled - currently not indexed” than you did previously because the bar is higher for what Google deems worth capturing from the web.
  56. 10 1 10 1 Google Wants to Crawl Even Less

    Gary Illyes has indicated that he wants to have Google crawl less. Search quality certainly cannot suffer, so crawlin has to get increasingly intelligent.
  57. 10 2 10 2 I Believe This is a Function

    of Information Gain Conceptually, as it relates to search engines, Information Gain is the measure of how much unique information a given document adds to the ranking set of documents. In other words, what are you talking about that your competitors are not?
  58. 104 10 4 There are Gold Standard Documents There is

    no indication of what this means, but the description mentions “human-labeled documents” versus “automatically labeled annotations.” I wonder if this is a function of quality ratings, but Google says quality ratings don’t impact rankings. So, we may never know.
  59. 105 Measure Your Content Against the Quality Rater Guidelines Elias

    Dabbas created a python script and tool that uses the Helpful Content Recommendations to show a proof of concept way to analyze your articles. We’d use the Search Quality Rater Guidelines which serve as the Golden Document standard. Code: https://blog.adver.tools/posts/llm-content-evaluation/ Tools: https://adver.tools/llm-content-evaluation/
  60. 10 6 106 In conclusion: “More content” is no longer

    inherently the most effective approach because there’s no guarantee of traffic from Google. Google’s sophistication won’t allow it.
  61. 10 8 10 8 I’m Leaving Y’all with Three Actions

    Today 1. How to Prune Your Content 2. How to Use LLMs to Generate Valuable Content 3. AI Tools to Use for SEO
  62. 11 1 Aleyda Has a Process Aleyda’s workflow is a

    great place to work through whether your content should be pruned or not. https://www.aleydasolis.com/en/crawli ng-mondays/how-to-prune-your-website- content-in-an-seo-process- crawlingmondays-16th-episode/
  63. 11 2 11 2 We like automate to get to

    a Keep. Revise. Kill. (Review.)
  64. 11 3 11 3 Content Decay The web is a

    rapidly changing organism. Google always wants the most relevant content, with the best user experience, and most authority. Unless you stay on top of these measures, you will see traffic fall off over time. Measuring this content decay is as simple comparing page performance period over period in analytics or GSC. Just knowing content has decayed is not enough to be strategic.
  65. 11 4 11 4 It’s not enough to know that

    the page has lost traffic.
  66. 11 8 11 8 Interpreting the Content Potential Rating 80

    - 100: High Priority for Optimization 60 - 79: Moderate Priority for Optimization 40 - 59: Selective Optimization 20 - 39: Low Priority for Optimization 0 - 19: Minimal Benefit from Optimization If you want quick and dirty, you can prune everything below a 40 that is not driving significant traffic.
  67. 11 9 11 9 Combining CPR with pages that lost

    traffic helps you understand if it’s worth it to optimize.
  68. 12 0 12 0 Step 1. Pull the Rankings Data

    from Semrush Organic Research > Positions > Export
  69. 12 1 12 1 Step 2: Pull the Decaying Content

    from GSC Google Search Console is a great source to spot Content Decay by comparing the last three months year over year. Filter for those pages where the Click Difference is negative (smaller than 0) then export.
  70. 12 2 12 2 Step 3: Drop them in the

    Spreadsheet and Press the Magic Button
  71. 12 3 The Output is List of URLs Prioritized by

    Action Each URL is marked as Keep, Revise, Kill or Review based on the keyword opportunities available and the effort required to capitalize on them. Sorting the URLs marked as “Revise” by Aggregated SV and CPR will give you the best opportunities first.
  72. 12 4 12 4 Get your copy of the Content

    Pruning Workbook : https://ipullrank.com/cpr-sheet
  73. 12 5 How to Kill Content Content may be valuable

    for channels outside of Organic Search. So, killing it is about changing Google’s experience of your website to improve its relevance and reinforce its topical clusters. The best approach is to noindex the pages themselves, nofollow the links pointing to them, and submit an XML sitemap of all the pages that have changed. This will yield the quickest recrawling and reconsideration of the content.
  74. 12 6 12 6 How to Revise Content Review content

    across the topic cluster Use co-occurring keywords and entities in your content Add unique perspectives that can’t be found on other ranking pages Answer common questions Answer the People Also Ask Questions Restructure your content using headings relevant to the above Add relevant Structured markup Expand on previous explanations Add authorship Update the dates Make sure the needs of your audiences are accounted for Add to an XML sitemap of only updated pages
  75. 12 7 How to Review Content The sheet marks content

    that has a low content potential rating and a minimum of 500 in monthly search volume as “Review” because they may be long tail opportunities that are valuable to the business. You should take a look at the content you have for that landing page and determine if you think the effort is worthwhile.
  76. 12 9 12 9 Combining a Search Engine with a

    Language Model is called “Retrieval Augmented Generation” Neeva (RIP), Bing, and now Google’s Search Generative Experience all use pull documents based on search queries and feed them to a language model to generate a response. This concept was developed by the Facebook AI Research (FAIR) team.
  77. 13 0 13 0 Google’s Initial Version of this is

    called Retrieval-Augmented Language Model Pre-Training (REALM) from 2021 REALM identifies full documents, finds the most relevant passages in each, and returns the single most relevant one for information extraction.
  78. 13 1 13 1 DeepMind followed up with Retrieval-Enhanced Transformer

    (RETRO) DeepMind's RETRO (Retrieval-Enhanced Transformer) is a language model that combines a large text database with a transformer architecture to improve performance and reduce the number of parameters required. RETRO is able to achieve comparable performance to state-of-the-art language models such as GPT-3 and Jurassic-1, while using 25x fewer parameters.
  79. 13 2 Google’s Later Innovation Retrofit Attribution using Research and

    Revision (RARR) RARR does not generate text from scratch. Instead, it retrieves a set of candidate passages from a corpus and then reranks them to select the best passage for the given task.
  80. 13 3 13 3 AIO is built from REALM/RETRO/RARR +

    PaLM 2 and MUM MUM is the Multitask Unified Model that Google announced in 2021 as way to do retrieval augmented generation. PaLM 2 is their latest (released) state of the art large language model. The functionality from REALM, RETRO, and RARR is also rolled into this.
  81. 13 5 13 5 Documents are Broken into Chunks and

    the Most Relevant Chunks are Fed to the Language Model to Generate a Response
  82. 13 7 13 7 Blocking LLMs is a Mistake. Appearing

    in these places will be recognized as brand awareness opportunities very soon.
  83. Embrace Structured Data There are three models gaining popularity: 1.

    KG-enhanced LLMs - Language Model uses KG during pre-training and inference 2. LLM-augmented KGs - LLMs do reasoning and completion on KG data 3. Synergized LLMs + KGs - Multilayer system using both at the same time https://arxiv.org/pdf/2306.08302.pdf Source: Unifying Large Language Models and Knowledge Graphs: A Roadmap
  84. 13 9 13 9 What is Mitigation for AIO? 1.

    Manage expectations on the impact 2. Understand the keywords under threat 1. Re-prioritize your focus to keywords that are not under threat 1. Optimize the passages for the keywords you want to save
  85. 14 1 14 1 We Can Also Show You Per

    Keyword How You Show Up
  86. 14 6 14 6 Fraggles Relevance Relevance against the chunks

    to keyword: Relevance against AI Snapshot:
  87. 15 1 The GEO team shared their ChatGPT prompts The

    GEO team also shared the ChatGPT prompts that help them improve their visibility. You can augment them and put them to work right away. https://github.com/GEO- optim/GEO/blob/main/src/geo_function s.py
  88. Check out @GarrettSussman’s post on how to optimize for AI

    Overviews: https://ipullrank.com/optimize-content-for-sge
  89. 15 4 With AI, I’m giving y’all legos. What you

    build is up to you, but I’m going to show things to consider.
  90. 15 7 LLaMa 3.1 was SOTA like 3 weeks ago

    Facebook’s open source model is outperforming the best closed-source models on a variety of different evaluation metrics. New open source models pop up weekly that continue to shift the state of the art.
  91. 16 3 163 You can now unlock state of the

    art generative AI use cases from your laptop for free.
  92. 16 4 Make Sure You Hook It Up To Your

    GPU On a Windows machine you’ll need to go to the NVIDIA Control Panel and add the Ollama server application under Manage 3D Settings.
  93. 16 6 16 6 The Three Laws of Generative AI

    content 1. Generative AI is not the end-all-be-all solution. It is not the replacement for a content strategy or your content team. 2. Generative AI for content creation should be a force multiplier to be utilized to improve workflow and augment strategy. 1. You should consider generative AI content for awareness efforts, but continue to leverage subject matter experts for lower funnel content.
  94. 16 7 16 7 Think back to 7 Minutes Ago

    - Retrieval Augmented Generation
  95. 16 8 16 8 It’s Not Difficult to Build with

    Llama Index sitemap_url = "[SITEMAP URL]" sitemap = adv.sitemap_to_df(sitemap_url) urls_to_crawl = sitemap['loc'].tolist() ... # Make an index from your documents index = VectorStoreIndex.from_documents(documents) # Setup your index for citations query_engine = CitationQueryEngine.from_args( index, # indicate how many document chunks it should return similarity_top_k=5, # here we can control how granular citation sources are, the default is 512 citation_chunk_size=155, ) response = query_engine.query("YOUR PROMPT HERE")
  96. 172 PAGE GENERATIVE AI PRODUCTIVITY USE CASES RAG opens up

    a series of generative AI use cases that work well for your situation. Briefing & Business Cases Content Analysis First-pass Brand Review First-pass Legal Review Content First Draft Keyword Insertion Structured Data Generation Link Identification & Insertion Generating Voiceovers Generating Images Generating Videos Writing Code
  97. 17 3 @BritneyMuller’s Guide to Using Colab Britney talked about

    how easy it is to use Colab with Python. Now it’s even easier to using LLMs. https://github.com/BritneyMuller/colab- notebooks?tab=readme-ov-file
  98. 17 4 Just describe what you want You can tell

    your language model what you want the code to do and it will handle the rest. If it doesn’t work, just describe what went wrong or paste the error and it will fix it for you. In this example my prompt is: {write python code for colab that takes a csv file of keywords and using bertopic with the chatgpt to compute the natural language topics for each row.}
  99. 18 9 Prompts You Need To Write ChatGPT is very

    effective at doing the following SEO related tasks: Page Title writing Meta Description writing Keyword Insertion Link Insertion You should use your own prompts for these though so you don’t copy other people’s patterns.
  100. 19 0 Page Titles Feature Token Count features Hypothesis There’s

    no hard max page title length indicated in the attributes so we can test lengths longer than the 60-70 characters to determine impact.
  101. 19 1 Page Title Test Hypothesis: A page title that’s

    longer than the standard best practice will negative impact rankings for primary keyword target. Variables: Control Short page title Long page title Metrics: Ranking Increase
  102. 19 2 lastSignificantUpdate lastSignificantUpdate - The date of the last

    time Google encountered the page as materially updated. Feature Hypothesis Making substantial updates to pages regularly yields improved crawl activity and more opportunities to rank better.
  103. 19 4 Test Structure The goal of this test is

    to determine how much content is considered a “significant update” that yields crawl activity. Create control and variants pages testing the length of added content: We measure the impact on organic traffic in order to capture changes to rankings and/or changes to clickthrough rate.
  104. 20 2 20 2 Thunderbit - Build No Code AI

    Automation Tools - https://thunderbit.com/
  105. 20 4 20 4 Octoparse - Combine a scraper with

    Generative AI - https://www.octoparse.ai/
  106. 21 0 21 0 What you should know and do

    to win Google is still the primary show in town Relevance is a quantitative measure GenAI works on the same math as search engines Focus on making your chunks for relevant to rank in GenAI Search Improve UX to drive more long clicks PPC and SEO are both operating on an threshold of expected CTR Focus on content your audience wants, prune what they don’t Use RAG to generate content with AI Embrace AI tools to improve your workflows and your ability to test
  107. Contact me if you want to get better results from

    your SEO: [email protected] Thank You | Q&A Award Winning, #GirlDad Featured by Content Pruning worksheet: https://ipullrank.com/cpr-sheet/ Download the Slides: https://speakerdeck.com/ipullrank Mike King Chief Executive Officer @iPullRank