The AI Tech SEO Compendium: Augmenting technical SEO tasks using ML & AI - SMX Munich 2024

The AI Tech SEO Compendium Bastian Grimm, Peak Ace AG
| @basgr Augmenting technical SEO tasks using ML & AI

What's the most tedious task you‘d like to automate?

Start small, validate initial ideas & concepts, then strategically scale
up where feasible.

4 peakace.agency Winning with proven AI use cases (in marketing)
Source: https://pa.ag/4bLPplv 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹-𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 Up to 95% accuracy automating answers 𝗦𝘂𝗺𝗺𝗮𝗿is𝗮𝘁𝗶𝗼𝗻 Up to 40% productivity gains in front and back office functions 𝗖𝗼𝗻𝘁𝗲𝗻𝘁 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 Up to 40% cost savings in content creation 𝗡𝗮𝗺𝗲𝗱 E𝗻𝘁𝗶𝘁𝘆 R𝗲𝗰𝗼𝗴𝗻𝗶𝘁𝗶𝗼𝗻 Up to 90% reduction of text reading and analysis work 𝗜𝗻𝘀𝗶𝗴𝗵𝘁 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻 Up to 80% faster in processing data 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 Up to 30% cycle time reduction in customer service support

This deck is fairly technical… (so don‘t complain afterwards!) https://pa.ag/smx24

Common level of knowledge on some AI topics Before we
dive straight in

What are Large Language Models (LLMs)?

Large Language Models (LLMs) are AI systems trained on vast
datasets (thus “large”) to understand, predict and generate data using transformer-based neural networks. Simply put:

9 peakace.agency A comprehensive overview of Large Language Models And
these are just some of the "bigger“ noteworthy LLMs being released until the end of 2023: Source: https://pa.ag/4cdB55B

What are LLMs good at?

11 peakace.agency Information Retrieval and Analysis LLMs can sift through
large volumes of text data to extract relevant information, summarise key points, and answer questions, making them valuable for research, data analysis, and decision-making support. Personalised Recommendations LLMs can analyse user preferences and behaviour to provide personalised recommendations, such as articles or products, thus enhancing UX and engagement. Natural Language Processing LLMs excel in understanding language, making them ideal for applications such as chat bots, language translation, sentiment analysis, and text summarisation. What are LLMs good at?

What are LLMs NOT good at?

13 peakace.agency Understanding Context Beyond Training Data LLMs may not
perform well in situations requiring an understanding of context or knowledge beyond their original training data set. Making Ethical or Moral Judgments LLMs lack the ability to make ethical or moral judgments and should not be used in situations where such considerations are crucial. Most LLMs’ decisions are also biased. Limited Understanding and Reasoning LLMs can't form a chain of logical conclusions, instead they’re following probability rules; even if the most common answer to a question is irrational or outright wrong, it will still provide said answer. What are LLMs NOT good at?

14 peakace.agency LLMs are also not good at creating original
content LLMs don’t “write” anything. They generate text based on probabilities and the number of parameters used in their training, using content they've encountered before.

LLM Deep Dive

16 peakace.agency The "most popular" available LLM (interface) right now:
Source: https://pa.ag/3AsVkun

Who is NOT using ChatGPT Plus?

… I'm not going to bother you with more prompts
for ChatGPT and how to speed up your SEO. I mean everyone's doing that by now anyway, right? Don‘t you worry…

19 peakace.agency There are tons of commercial AI solutions available
From ChatGPT, Azure AI to NVIDIA‘s AI Platform and IBM watsonx – everyone in big tech is offering "something“:

What about our friends over at Google?

21 peakace.agency How Google’s February ’24 went – in a
nutshell Source: https://pa.ag/3uQ9dU1

22 peakace.agency ICYMI: OpenAI announced Sora Sora is currently only
accessible for red team members – experts in areas such as misinformation, hateful content, and bias – to examine critical areas for potential problems or risks, however the preview is quite impressive: Sources: https://pa.ag/3IcBJm3 & https://pa.ag/4a7V2cb & https://pa.ag/3V1V2pw The excitement from the press has been reminiscent of the buzz surrounding the image creator DALL-E or ChatGPT: Sora is described as “eye-popping,” “world- changing,” and “breath- taking, yet terrifying.”

23 peakace.agency Back to Google: say goodbye to Bard and
hello to Gemini – Google’s AI chat bot gets a new name

24 peakace.agency There's more! Gemini is a family of multimodal
LLMs developed by Google DeepMind Multimodality Input/output using multiple formats (e.g., text, audio, video, gestures, etc.) Reinforcement learning Drastically reduce hallucinations 3rd party integrations High efficiency when using external tools and API integrations Memory capabilities Build and expand the knowledge bank while the model learns

25 peakace.agency Unsurprising to see this after the Hugging Face
deal? Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology Google used to create the Gemini models: Sources: https://pa.ag/3T9q0cK & https://pa.ag/4akEihZ

26 peakace.agency Also, a variety of (free) open-source models are
available Hugging Face’s Open LLM Leaderboard aims to track, rank and evaluate open source LLMs on different benchmarks: Source: https://pa.ag/3L2qUEV

Despite not being quite as powerful (yet), they are available
to download, customise and self-host. The beauty of these?

28 peakace.agency But where to start, and which LLM to
use? From LLaMa 2, Falcon to Dolly 2.0 and MPT or Bloom – the choice is yours (yep, I know… overwhelming much?) Source: https://pa.ag/3Td5ucz LLaMa 2 A well-performing open source LLM (with license for commercial use) that encompasses pre-trained and fine-tuned generative text models with 7 to 70 billion parameters. Vicuna & Alpaca Use the LLaMa model as basis and (like Google’s Bard and OpenAI’s ChatGPT) are fine-tuned to follow instructions. Vicuna matches GPT-4's performance. Falcon LLM Can be used with chat bots to generate text, solve complex problems and reduce and automate repetitive tasks. Falcon 6B & 40B are available as raw models for fine-tuning.

29 peakace.agency How to host and run your private LLM?
Easy… let’s just ask ChatGPT how to do it, shall we?

30 peakace.agency LM Studio: Discover, download, and run local LLMs
Run an AI on your desktop using locally installed open-source Large Language Models (LLMs) for free! Source: https://pa.ag/3UW0Dh7 With LM Studio, you can... ▪ Run LLMs on your laptop, even while offline (Win, Mac & Linux) ▪ Use models through the in-app UI or an OpenAI-compatible local server ▪ Download any compatible model files from Hugging Face repositories ▪ Discover new LLMs in the app

31 peakace.agency My favourite: Ollama – get LLMs up and
running, locally Command line only. Use the PageAssist Chrome plug-in (a web UI for local LLMs) to control Ollama, including model pulls, configuration, and running LLM dialogues/chats. Sources: https://pa.ag/48A07se & https://pa.ag/48xAnNn Pro tip: Ollama runs at 127.0.0.1:11434 by default (and offers APIs as well)

32 peakace.agency Want to try for yourself, but you’re not
a developer? Solutions such as stack or LLMStack offer no-code DIY approaches by connecting and combining a variety of data sources through APIs and other endpoints including LLMs. Sources: https://pa.ag/3wu1UlA & https://pa.ag/3UXzY3K

33 peakace.agency Peak Ace’s current favourites: balancing speed & scalability
A small selection of platforms that we feel are most convenient to start with. If you’d like to chat about it – come meet the Peak Ace team outside at our SMX booth! More complex, but worth checking out if you’re into this stuff: Hugging Face LLM Inference Container for Amazon SageMaker

34 peakace.agency Keep in mind: There are risks that need
to be managed (Obviously, this is true for both commercial and open-source models) Source: https://pa.ag/3Td5ucz Consent Ensuring training data was gathered with accountability, meaning it follows AI governance processes (compliant w/ laws & regulations) Security Security problems can include data leaks or cyber criminals using the LLM for a variety of malicious tasks Bias Happens when the data source is not diverse or representative enough Hallucinations Can result from the LLM being trained on incomplete, contradictory, or inaccurate data, or from predictions in general

35 peakace.agency Will hallucinations ever disappear? "It’s inherent in the
mismatch between the technology and the proposed use cases," says Emily Bender, professor in the Department of Linguistics and director of the Computational Linguistics Laboratory at the University of Washington. Source: https://pa.ag/3PqP0Mh LLMs are designed to predict the next word – of course there will be cases where the model is wrong.

36 peakace.agency Source: https://pa.ag/3MMah0X Life or Death: AI-generated mushroom foraging
books are all over Amazon; Experts are worried that books produced by ChatGPT […] which target beginner foragers, could end up killing someone. Terry Pratchett peakace.agency This can REALLY go wrong…

Retrieval-Augmented Generation We also need to talk RAG

Simply put, RAG integrates LLMs with external databases or APIs,
thus enabling real-time information retrieval for up-to-date and more accurate responses. So, what‘s the deal?

RAG “fixes“ the issue with outdated information due to training
data cut-off.

40 peakace.agency The conceptual flow of using RAG with an
LLM RAG can be used to enhance the accuracy and reliability of gen AI models with facts fetched from external sources. Source: https://pa.ag/3STryHY Generated text response 5 Large Language Model EndPoint Prompt + Query + Enhanced context 4 Prompt + Query 1 Knowledge sources Search relevant information Query 2 Relevant Information for enhanced context 3

41 peakace.agency Why is RAG better/more efficient than other approaches?
Because it can handle noisy or irrelevant information, refrain from answering when there is insufficient knowledge and integrate with a variety of different sources simultaneously. Source: https://pa.ag/3TdUgoc

42 peakace.agency The Self-RAG framework enhances LLM quality & factuality
Self-RAG improves the output quality of LLMs by integrating retrieval, generation, and self-critique mechanisms. Source: https://pa.ag/3QP6MZ4 Self-RAG’s approach is to selectively retrieve relevant information and critique both the retrieved content and its own outputs, offering a more refined performance across various tasks compared to existing models.

43 peakace.agency Some real-world RAG use cases we’ve built in
recent months Some of the most common cases we’ve seen and worked on for our clients over the last months: Chatbot Use RAG to incorporate LLMs into Q&A chat bots allowing for more accurate answers based on data from company documents. Knowledge engine Ask questions based on your data to provide context for LLMs and greatly increase the quality and accuracy of answers. Search augmentation Incorporate LLMs into onsite search (engines) and augmenting the results with LLM-generated answers/content leading to higher quality results.

If you tie it all together (tools, databases, APIs, models,
etc.) you can build some REALLY cool stuff!

Let‘s talk about some typical tech SEO challenges: redirects

But mainly because you’re doing it wrong… Redirect mapping at
scale is a tedious, error-prone task

47 peakace.agency Collect URL inventory with a crawling tool… ...
and then somehow (usually manually) align it with the new target structure (depending on the respective content) and generate the redirect mapping file from this. …or any other crawler of choice. …or Google Sheets. Creating 1-to-1 redirect mappings for old content is often done in Excel. Then attempts are made to manually assign titles, headings or URLs.

For example, outdated categories are often "blindly" redirected to parent
categories or even the homepage.

Yeah, horrifying indeed…

AI makes this process significantly faster and more efficient

51 peakace.agency Embeddings and vector database = redirect win Necessary
steps for better automated redirects (and an improved customer journey): Extract main content of every (old) site/URL Generate embeddings Save together with metadata in vector database Semantic search in vector DB based on embeddings of old URLs

52 peakace.agency And our SEOs be like...

53 peakace.agency I got 99 problems but AI ain’t one…!
(at least for now) Grab one outside, expo hall, booth #1 (ground floor) – see you there!

Word embeddings are numerical vectors representing words, capturing their meanings
and relationships in a multidimensional space. What are (word) embeddings?

You can convert any word into a vector and start
calculating with them: "king" minus "man" plus "woman" equals "queen". Synonyms and more can also be found this way. What are (word) embeddings?

A vector DB utilises data embeddings as index, facilitating fast
and scalable searches among unstructured data points, enhancing efficiency in retrieving similar items or information. What about vector databases?

A vector DB allows you to find matches between anything
and anything (e.g., use an image as a query to find similar pieces of text, video, other images, etc.). Simply put:

A quick, step-by-step overview: Putting it all together

59 peakace.agency Extracting the main content of every old URL
<title> tag <h1>s each first & last sentence <p> <h2>s <h2>s Combine everything Content = Title + h1 + h2s + … ▪ Extract: <title> + main content ▪ Combine: <title>, <h1>, <h2>s and first & last sentence of each paragraph

60 peakace.agency Generate embeddings and store in vector database For
each website URL: ▪ Transfer previously generated content to vector DB ▪ Generate embeddings (BERT, GloVe, FastText) ▪ Save embeddings in a vector DB incl. metadata (URL, title, etc.) Content Content Content 0.03 … 0.19 -0.21 … 0.03 0.08 … -0.15

61 peakace.agency Search the vector database for the best semantic
match For every outdated page: ▪ Vectoric semantic search for KNN (k-nearest neighbour) ▪ Set 301 to NN URL ▪ No more weak redirects ▪ Play with certainty/ temperature settings 0.31 … -0.41 {Get { Article ( nearVector: { limit: 1, content: { vector:[embedding], certainty: 0.8 } } ) { url } }} Future 404

Down the rabbit hole…

There are A LOT of different ways of doing this…

64 peakace.agency State-of-the-art sentence embeddings are the gold standard The
Levenshtein distance (basic fuzzy matching) provides an alternative, as we’re mainly dealing with small text snippets and minimal deviations between URL versions: Source: https://pa.ag/49RHG3y The more substantial the changes between two versions, the higher the likelihood that you’ll reap significant benefits from leveraging sentence transformers. h/t Will Nye for the data set

Calculating similarity scores across multiple elements and selecting the best
matches always works best. Rule of thumb

Matching on a singular element performs always worse – regardless
of the approach you choose. Rule of thumb

Pattern match first! BTW: before you try any of this…

… you need solid QA afterwards! Whatever you choose…

Garbage in = garbage out! Don‘t forget about input quality

As with most things, it can boost efficiency, but it
isn't a complete replacement for a human.

Who loves ScreamingFrog?

Analyse page contents and automatically create redirect maps based on
two (old vs new) SF crawls. Facebook AI Similarity Search (FAISS)

73 peakace.agency Automated redirect matchmaker for site migrations Fantastic script
by Daniel Emery utilising two SF crawls (origin + destination.csv with titles, metas, URLs and headings) to perform a fast semantic search (using sentence transformers) and create a redirect map: Sources: https://pa.ag/4bWAgxy & https://pa.ag/3USteUJ FAISS is an outstanding library designed for the fast retrieval of nearest neighbours in high- dimensional spaces. It enables quick semantic nearest neighbour searches even on a large scale.

Not 100% perfect, but ~90% accurate/sensible matches are perfectly realistic.
Significant time savings

You can use the same approach e.g., for much better
internal linking as well as reverse content gap analysis. This doesn’t only work for redirects…

Be smart with your redirects: put them on the edge

Wait, what?

78 peakace.agency Cloudflare Workers to execute redirects on CDN/edge level
I already spoke about using CF Workers for a variety of technical SEO tasks including redirects at the SMX Advanced in Berlin back in 2021. Looking to dive deeper? Make sure to grab a copy of the deck: Source: https://pa.ag/4bSxauE Pro tip: this rarely requires dev resources; either you can do it yourself, or sys ops (less busy)

79 peakace.agency

80 peakace.agency Naturally, Cloudflare is all in on AI as
well… Build and deploy AI applications to CFs global network: all it takes is a few lines of code with Workers AI to run an AI task using the Workers framework (or any other stack via API): Source: https://pa.ag/3IgVBV6

81 peakace.agency Workers AI – an AI inference as a
service platform Empowering developers to run well-known AI models with just a few lines of code on serverless GPUs, all on CFs trusted global network: Source: https://pa.ag/3Tgqlfh TL;DR: using the LLM of choice without having to worry about hosting, deployment, scale, …

82 peakace.agency But it doesn‘t stop there: meet Vectorize Use
Vectorize to power e.g., semantic search, etc. directly with Workers, improve accuracy and context of answers from LLMs, and/or bring-your-own embeddings from other platforms, including OpenAI and Cohere: Sources: https://pa.ag/49Rys7u & https://pa.ag/3wq2AIr

You do realise I just solved all your implementation problems!?

84 peakace.agency

Automating SEO tasks & workflows with Custom GPTs

Custom GPTs are a way to create tailored, custom versions
of ChatGPT that combine instructions, extra knowledge, and any combination of skills. What are Custom GPTs (for ChatGPT)?

87 peakace.agency A Custom GPT in its simplest form: Using
Peak Ace’s Structured Data GPT to debug and fix errors in JSON-LD mark-up

Who here has already created their own Custom GPT?

89 peakace.agency Unveiling Peak Ace’s GPT Suite Source: https://pa.ag/peakace-gptsuite SEO
Writing Assistant For keyword analysis, SEO content checks, readability assessments, competitor analyses, multilingual support, mobile optimisation, and more: https://pa.ag/seo- writing-assistant Outreach Hero For crafting unique email templates, engaging subject lines, clear messages and more: https://pa.ag/ outreach-hero PPC Performance Analyzer For data analysis and adaptability, optimisation suggestions and more, all with perfect confidentiality: https://pa.ag/ppc- performance-analyzer Structured Data GPT For analysing and troubleshooting structured data for SEO, optimisation suggestions, technical implementation support, and more: https://pa.ag/ structured-data

OK, I get it… boooooooring!?

But what about 3rd party data integrations (e.g., via API)?

92 peakace.agency Making GPTs smarter with external data A Custom
GPT to connect with the DataForSEO API to allow for real- time access to actual search data:

Why use external data?

Well, no… the (training) data is insufficient and/or outdated, numbers
are either non-existent or completely made up. ChatGPT can do this out of the box, can’t it?

Here‘s a quick three-step guide on how to DIY it.
So, how can you build this yourself?

96 peakace.agency #1 Provide basic info to get started (name,
description, …) Login to ChatGPT > choose Explore GPTs > Create (you need ChatGPT Plus) Well defined instructions are key, think prompting.

97 peakace.agency #2 Create an ‘Action’ to call a 3rd
party API Head to your API provider and grab your credentials. In our case this was the API Dashboard at DataForSEO.com: Get the OpenAPI Schema for DataForSEO: https://pa.ag/3Pa7oZ3 To use with an action, you need to generate a base64- encoded version of your login credentials: btoa(‘APIemail:APIpass’) The annoying part: you need a Schema according to the OpenAPI spec. But no one reads docs anymore – we just leverage ChatGPT to do this:

Remember: APIs usually aren‘t free, so make sure you only
publish your new Custom GPT for yourself! #3 Test and publish your GPT

Just reauthenticate (base64-encoded version of your login). You also need
a new schema (again based on OAS spec). Customisation for other APIs is easy (e.g., Sistrix, etc.)

100 peakace.agency Did you know? You can link using pre-filled
prompts! You can also link directly to pre-filled prompts and execute them – which works for both Custom GPTs and GPT-4 models. Simply add the query string (using “q=xxx“) to the end of your ChatGPT URL. Source: https://pa.ag/crsum 𝗙𝗼𝗿 any C𝘂𝘀𝘁𝗼𝗺 𝗚𝗣𝗧 𝗮𝗱𝗱: ?q=your+prompt+goes+here 𝗙𝗼𝗿 the 𝗚𝗣𝗧-𝟰 𝗯𝗮𝘀𝗲 𝗺𝗼𝗱𝗲𝗹: ?model=gpt-4&q=your+prompt Use directly in your Chrome browser

101 peakace.agency When to use a Custom GPT? Long-term context
Custom GPTs are a really powerful tool to ensure instructions remain contextualised over long periods of time. Besides seamless 3rd party data integration, my top-3 reasons why building and using Custom GPTs can help a lot: Building workflows Custom GPTs are best suited for composing workflows aimed at people who don’t know how to properly design context with prompt sequences. Sharing instructions For sharing the exact same instructions e.g., cross-team, without having to worry about specifying them (and how) at prompt level.

102 peakace.agency BTW: not compatible & very different… GPTs for
MS Copilot ChatGPT has almost completely replaced plugins with GPTs. On Copilot, plugins call on external services. However, Copilot GPTs are a conversation with a specific goal: Source: https://pa.ag/3wyCrr0

103 peakace.agency Copilot + Excel: taking data analysis to a
whole new level! Super pumped for this, as it’ll enable just about anyone to question, analyse, visualise and refine data effortlessly: Source: https://pa.ag/49YhhBe

I know, I know… it‘s a lot.

105 peakace.agency Looking to learn more about AI this year?
Some new (and free) AI courses and resources to help you boost your AI knowledge: Sources: https://pa.ag/48NKLkk / https://pa.ag/4c2sa6U / https://pa.ag/48FldFJ / https://pa.ag/4c2smTG / http://pa.ag/ai

Want to chat about AI & grab a t-shirt? Meet
Peak Ace in the expo hall, booth #1 (ground floor) = https://pa.ag/smx24

The AI Tech SEO Compendium: Augmenting technica...

The AI Tech SEO Compendium: Augmenting technical SEO tasks using ML & AI - SMX Munich 2024

More Decks by Bastian Grimm

Other Decks in Marketing & SEO

Featured

Transcript