Upgrade to Pro — share decks privately, control downloads, hide ads and more …

brightonSEO Set 23 - Exploiting XPath in ScreamingFrog and Google Sheets

Thiago
September 17, 2023

brightonSEO Set 23 - Exploiting XPath in ScreamingFrog and Google Sheets

Thiago

September 17, 2023
Tweet

Other Decks in Marketing & SEO

Transcript

  1. Exploiting XPath in
    ScreamingFrog and
    Google Sheets
    Thiago Pojda
    SIXT
    Speakerdeck.com/pojda
    @tedois

    View full-size slide

  2. The Nerd
    ● Husband, dad
    ● Software Engineer
    ● SEO since 2008
    ● Since 2015, wesearch.media
    ● Since 2019, living in DE
    ● Since 2022, Director SEO @ SIXT

    View full-size slide

  3. The Nerd
    ● Husband, dad
    ● Software Engineer
    ● SEO since 2008
    ● Since 2015, wesearch.media
    ● Since 2019, living in DE
    ● Since 2022, Director SEO @ SIXT

    View full-size slide

  4. 2,000+ locations
    100+ countries
    270,000+ cars
    7,500+ employees
    20 SEOs

    View full-size slide

  5. SEO @ SIXT

    SEO Specialists

    Local, Content, Tech, Authority, Data

    Product specialists

    View full-size slide

  6. Understanding how sites are built
    and how to extract information
    from pages has made my
    analysis much more relevant to
    both me and my clients

    View full-size slide

  7. Source: https://en.wikipedia.org/wiki/Document_Object_Model

    View full-size slide

  8. XPath
    //div[@class=“content”]/text()

    View full-size slide

  9. Expanded syntax is handy but looks
    terrible
    /descendant-or-self::div[@class=“content”]/child::text()

    View full-size slide

  10. Abbreviated syntax is enough for
    99% of the cases
    /descendant-or-self::div[@class=“content”]/child::text()
    //div[@class=“content”]/text()

    View full-size slide

  11. Abbreviated syntax is enough for
    99% of the cases
    /descendant-or-self::div[@class=“content”]/child::text()
    //div[@class=“content”]/text()

    View full-size slide

  12. Specifier Meaning
    / Selects a direct child
    // Select any descendant or
    self
    @ Attribute
    .. Parent
    . Self

    View full-size slide

  13. Specifier Meaning
    [ and ] Official name is Predicates,
    you can think of them as
    filters
    * Any element
    function() … a function

    View full-size slide

  14. Useful functions

    text()

    contains(where, what)

    normalize-space(text)

    starts-with(where, what) & ends-
    with(where, what)

    sum()

    and, or

    View full-size slide

  15. //author[contains(.,"Matt")]
    Matches on all author nodes, in
    current node contains Matt
    (case-sensitive)

    View full-size slide

  16. //author[starts-with(.,"G")]
    Matches on all author nodes, in
    current node starts with G
    (case-sensitive)

    View full-size slide

  17. //author[matches(.,"Matt.*")]
    Regular expressions match
    Source: https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
    License: CC BY 4.0

    View full-size slide

  18. //h3[1]
    The first H3 element

    View full-size slide

  19. //h3[last()]
    The last H3 element

    View full-size slide

  20. //h3[last()-1]
    The one before last H3 element

    View full-size slide

  21. //img[not(@alt)]
    Only images without alt attribute

    View full-size slide

  22. //img[@alt]
    Only images with alt attribute
    (will match empty alts)

    View full-size slide

  23. //img[string-length(@alt) >= 1]
    Only images with alt attribute
    longer than 1 character

    View full-size slide

  24. Source: https://librarycarpentry.org/lc-webscraping/02-xpath/index.html
    License: CC BY 4.0

    View full-size slide

  25. https://brightonseo.com/people/thiago-pojda

    View full-size slide

  26. Formula Result
    =importxml(“//title”, D1) Thiago Pojda
    =importxml(“//h1”, D1) Thiago Pojda
    =join(,importxml(“//h1/following-
    sibling::*[1]”, D1))
    SIXT | Director of SEO & In-house Dad Joke
    Specialist
    =importxml(“//h1/following-
    sibling::*[2]”, D1)
    Thiago is a Brazilian SEO nerd who loves
    learning about (and nudging)
    consumer behaviours. Worked several years
    with SEO for big and small brands
    both at agencies and as in-house, he now leads
    the SEO Team at SIXT in
    Germany.
    =importxml(“//h1/..//a/@href”, D1) https://www.sixt.com/
    https://twitter.com/tedois
    https://linkedin.com/in/pojda

    View full-size slide

  27. Find XPath via SF

    View full-size slide

  28. #1
    Pages with SEO Text

    View full-size slide

  29. //article[@data-testid='streamSeoSection']

    View full-size slide

  30. Is my competitor using “SEO texts”?
    On which page types?
    Any category they’re doing it more?
    Why?
    WHY?
    WHY?

    View full-size slide

  31. Is my competitor using “SEO texts”?
    On which page types?
    Any category they’re doing it more?
    Why?
    WHY?
    WHY?

    View full-size slide

  32. Is my competitor using “SEO texts”?
    On which page types?
    Any category they’re doing it more?
    Why?
    WHY?
    WHY?

    View full-size slide

  33. Export your crawl (internal all)

    View full-size slide

  34. Categorise what you see

    View full-size slide

  35. #2
    Products per category

    View full-size slide

  36. //div[@id="listing"]//div[contains(text()
    ,"Ergebnisse")]/text()

    View full-size slide

  37. #3
    Mapping indexable filters

    View full-size slide

  38. //a[contains(@class,"pill")]/text()
    //a[contains(@class,"pill")]/@href

    View full-size slide


  39. Think about elements you can use to
    breakdown your competitor’s strategy

    Find a great XPath for it, crawl,
    analyze

    Hate XPath until you love it

    View full-size slide

  40. Read more

    https://librarycarpentry.org/lc-
    webscraping/02-xpath/index.html

    https://www.searchenginejournal.com/xpath
    s-large-site-audits/329851/

    https://twitter.com/tedois

    View full-size slide

  41. Exploiting XPath in
    ScreamingFrog and
    Google Sheets
    Thiago Pojda
    SIXT
    Speakerdeck.com/pojda
    @tedois

    View full-size slide