Beat Signer Department of Computer Science Vrije Universiteit Brussel beatsigner.com Department of Computer Science Vrije Universiteit Brussel beatsigner.com
November 26, 2024 Search Engine Result Page ▪ There is a variety of information shown on a search engine result page (SERP) ▪ organic search results ▪ non-organic search results ▪ meta-information about the result (e.g.number of result pages) ▪ vertical navigation ▪ advanced search options ▪ query refinement suggestions ▪ ...
November 26, 2024 Search Engine History ▪ Early "search engines" include various systems starting with Bush's Memex ▪ Archie (1990) ▪ first Internet search engine ▪ indexing of files on FTP servers ▪ W3Catalog (September 1993) ▪ first "web search engine" ▪ mirroring and integration of manually maintained catalogues ▪ JumpStation (December 1993) ▪ first web search engine combining crawling, indexing and searching
November 26, 2024 Search Engine History ... ▪ In the following two years (1994/1995) many new search engines appeared ▪ AltaVista, Infoseek, Excite, Inktomi, Yahoo!, ... ▪ Two categories of early Web search solutions ▪ full-text search - based on an index that is automatically created by a web crawler in combination with an indexer - e.g. AltaVista or InfoSeek ▪ manually maintained classification (hierarchy) of webpages - significant human editing effort - e.g. Yahoo (until 2014)
November 26, 2024 Information Retrieval ▪ Precision and recall can be used to measure the performance of different information retrieval algorithms documents retrieved documents retrieved documents relevant precision = documents relevant documents retrieved documents relevant recall = D 1 D 2 D 4 D 6 D 7 D 10 D 3 D 5 D 8 D 9 D 1 D 3 D 8 D 9 D 10 query 6 . 0 5 3 precision = = 75 . 0 4 3 recall = =
November 26, 2024 Information Retrieval ... ▪ Often a combination of precision and recall, the so-called F-score (harmonic mean) is used as a single measure D 1 D 2 D 4 D 6 D 7 D 10 D 3 D 5 D 8 D 9 D 1 D 3 D 8 D 9 D 10 query 57 . 0 precision = 1 recall = recall precision recall precision 2 score - F + = D 1 D 2 D 4 D 6 D 7 D 10 D 3 D 5 D 8 D 9 D 1 D 3 D 8 D 9 D 10 query 6 . 0 precision = 75 . 0 recall = 67 . 0 score - F = D 5 D 2 73 . 0 score - F =
November 26, 2024 Bank Delhaize Ghent Metro Shopping Train D1 D2 D3 D4 D5 D6 1 Boolean Model ▪ Based on set theory and boolean logic ▪ Exact matching of documents to a user query ▪ Uses the boolean AND, OR and NOT operators ▪ query: Shopping AND Ghent AND NOT Delhaize ▪ computation: 101110 AND 100111 AND 000111 = 000110 ▪ result: document set {D4 ,D5 } 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 ... ... ... ... ... ... ... inverted index
November 26, 2024 Boolean Model ... ▪ Advantages ▪ relatively easy to implement and scalable ▪ fast query processing based on parallel scanning of indexes ▪ Disadvantages ▪ no ranking of output ▪ often the user has to learn a special syntax such as the use of double quotes to search for phrases ▪ Variants of the boolean model form the basis of many search engines ▪ inverted index
November 26, 2024 Web Search Engines ▪ Most web search engines are based on traditional information retrieval techniques, but they must be adapted to deal with the characteristics of the Web ▪ immense amount of web resources (>150 billion web pages) ▪ hyperlinked resources ▪ dynamic content with frequent updates ▪ self-organised web resources ▪ Evaluation of performance ▪ no standard collections ▪ often based on user studies (satisfaction) ▪ Of course, not only the precision and recall but also the query answer time is an important issue
November 26, 2024 Web Crawler ▪ A web crawler or spider is used to create an index of webpages to be used by a web search engine ▪ any web search is then based on this index ▪ Web crawler has to deal with the following issues ▪ freshness - the index should be updated regularly (based on web page update frequency) ▪ quality - since not all web pages can be indexed, the crawler should give priority to "high quality" pages ▪ scalability - it should be possible to increase the crawl rate by just adding additional servers (modular architecture) - e.g. the estimated number of Google servers in 2016 was 2.5 million (including not only the crawler but the entire Google platform)
November 26, 2024 Web Crawler ... ▪ distribution - the crawler should be able to run in a distributed manner (computer centres all over the world) ▪ robustness - the Web contains a lot of pages with errors and a crawler must deal with these problems - e.g. deal with a web server that creates an unlimited number of "virtual web pages" (crawler trap) ▪ efficiency - resources (e.g. network bandwidth) should be used in the most efficient way ▪ crawl rates - the crawler should pay attention to existing web server policies (e.g.revisit-after HTML meta tag or robots.txt file) User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ robots.txt
November 26, 2024 Pre-1998 Web Search ▪ Find all documents for a given query term ▪ use information retrieval (IR) solutions - boolean model - vector space model - ... ▪ ranking based on "on-page factors" → problem: poor quality of search results (order) ▪ Larry Page and Sergey Brin proposed to compute the absolute quality of a page called PageRank ▪ based on the number and quality of pages linking to a page (votes) ▪ query-independent
November 26, 2024 Origins of PageRank ▪ Developed as part of an academic project at Stanford University ▪ research platform to aid under- standing of large-scale web data and enable researchers to easily experiment with new search technologies ▪ Larry Page and Sergey Brin worked on the project about a new kind of search engine (1995-1998) which finally led to a functional prototype called Google Larry Page Sergey Brin
November 26, 2024 PageRank ▪ A page Pi has a high PageRank Ri if ▪ there are many pages linking to it ▪ or, if there are some pages with a high PageRank linking to it ▪ Total score = IR score × PageRank P1 R1 P2 R2 P3 R3 P4 R4 P5 R5 P6 R6 P7 R7 P8 R8
November 26, 2024 Basic PageRank Algorithm ▪ where ▪ Bi is the set of pages that link to page Pi ▪ Lj is the number of outgoing links for page Pj = i j B P j j i L P R P R ) ( ) ( P1 P2 P3 P1 1 P2 1 P3 1 P1 1.5 P2 1.5 P3 0.75 P1 1.5 P2 1.5 P3 0.75
November 26, 2024 Matrix Representation ▪ Let us define a hyperlink matrix H P1 P2 P3 = otherwise 0 if 1 i j j ij B P L H = 0 2 1 0 0 0 1 1 2 1 0 H ( ) i P R = R and HR R = R is an eigenvector of H with eigenvalue 1 →
November 26, 2024 Matrix Representation ... ▪ We can use the power method to find R ▪ sparse matrix H with 150 billion columns and rows but only an average of 10 non-zero entries in each column t t HR R = +1 = 0 2 1 0 0 0 1 1 2 1 0 H For our example this results in or 1 2 2 = R 2 . 0 4 . 0 4 . 0
November 26, 2024 Dangling Pages (Rank Sink) ▪ Problem with pages that have no outgoing links (e.g. P2 ) ▪ Stochastic adjustment ▪ if page Pj has no outgoing links then replace column j with 1/Lj ▪ New stochastic matrix S always has a stationary vector R ▪ can also be interpreted as a Markov chain P1 P2 = 0 1 0 0 H and 0 0 = R = 2 1 0 2 1 0 C = + = 2 1 1 2 1 0 C H S and C C
November 26, 2024 Strongly Connected Pages (Graph) ▪ Add new transition proba- bilities between all pages ▪ with probability d we follow the hyperlink structure S ▪ with probability 1-d we choose a random page ▪ matrix G becomes irreducible ▪ Google matrix G reflects a random surfer ▪ no modelling of back button P1 P2 P3 P4 P5 ( ) 1 S G n d d 1 1 − + = GR R = 1-d 1-d 1-d
November 26, 2024 Examples ▪ PageRank leakage A1 0.10 A2 0.14 A3 0.14 B1 0.22 B2 0.20 B3 0.20 ( ) 38 . 0 = A P ( ) 62 . 0 = B P ( ) 1 S G n d d 1 1 − + =
November 26, 2024 Examples ▪ PageRank feedback A1 0.35 A2 0.24 A3 0.18 B1 0.09 B2 0.07 B3 0.07 ( ) 77 . 0 = A P ( ) 23 . 0 = B P ( ) 1 S G n d d 1 1 − + =
November 26, 2024 Google Search Central ▪ Various services and infor- mation about a website ▪ Site configuration ▪ submission of sitemap ▪ crawler access ▪ URLs of indexed pages ▪ Performance ▪ search queries ▪ countries ▪ devices ▪ …
November 26, 2024 Google Search Central … ▪ Enhancements ▪ core web vitals (speed) - mobile as well as desktop ▪ mobile usability ▪ Security issues ▪ Similar tools offered by other search engines ▪ e.g.Bing Webmaster Tools
November 26, 2024 XML Sitemaps ▪ List of URLs that should be crawled and indexed <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.example.com/sitemap/0.9"> <url> <loc>https://beatsigner.com/</loc> <lastmod>2024-11-24</lastmod> <changefreq>weekly</changefreq> <priority>1.0</priority> </url> <url> <loc>https://beatsigner.com/publications.html</loc> <lastmod>2024-11-24</lastmod> <changefreq>weekly</changefreq> <priority>0.9</priority> </url> ... </urlset>
November 26, 2024 XML Sitemaps ... ▪ All major search engines support the sitemap format ▪ The URLs of a sitemap are not guaranteed to be added to a search engine's index ▪ helps search engine to find pages that are not yet indexed ▪ Additional metadata might be provided to search engines ▪ relative page relevance (priority) ▪ date of last modification (lastmod) ▪ update frequency (changefreq)
November 26, 2024 Questions ▪ Is PageRank fair? ▪ What about Google's power and influence? ▪ What about Web 2.0 or Web 3.0 and web search? ▪ "non-existent" webpages such as offered by Rich Internet Applications (e.g. using AJAX) may bring problems for traditional search engines (hidden web) ▪ new forms of social search - social bookmarking - ... ▪ social marketing
November 26, 2024 The Google Effect ▪ A recent study by Sparrow et al. shows that people less likely remember things that they believe to be accessible online ▪ Internet as a transactive memory ▪ Does our memory work differently in the age of Google? ▪ What implications will the future of the Internet and new search have?
November 26, 2024 Search Engine Marketing (SEM) ▪ For many companies Internet marketing has become a big business ▪ Search engine marketing (SEM) aims to increase the visibility of a website ▪ search engine optimisation (SEO) ▪ paid search advertising (non-organic search) ▪ social media marketing ▪ SEO should not be decoupled from a website's content, structure, design and used technologies ▪ SEO has to be seen as a continuous process in a rapidly changing environment ▪ different search engines with regular changes in ranking
November 26, 2024 Structural Choices ▪ Keep the website structure as flat a possible ▪ minimise link depth ▪ avoid pages with much more than 100 links ▪ Think about your website's internal link structure ▪ which pages are directly linked from the homepage? ▪ create many internal links for important pages ▪ be "careful" about where to put outgoing links - PageRank leakage ▪ use keyword-rich anchor texts ▪ dynamically create links between related content - e.g. "customer who bought this also bought ..." or "visitors who viewed this also viewed ..." ▪ Increase the number of pages
November 26, 2024 Technological Choices ▪ Use SEO-friendly content management system (CMS) ▪ Dynamic URLs vs.static URLs ▪ avoid session IDs and parameters in URL ▪ use URL rewriting to get descriptive URLs containing keywords ▪ Think carefully about the use of dynamic content ▪ Rich Internet Applications (RIAs) based on AJAX etc. ▪ content hidden behind pull-down menus etc. ▪ Address webpages consistently ▪ https://www.vub.ac.be https://www.vub.ac.be/index.php
November 26, 2024 Search Engine Optimisations ▪ Different things can be optimised ▪ on-page factors ▪ off-page factors ▪ It is assumed that some search engines use more than 200 on-page and off-page factors for their ranking ▪ Difference between optimisation and breaking the "search engine rules" ▪ white hat and black hat optimisations ▪ A bad ranking or removal from index can cost a company a lot of money or even mark the end of the company ▪ e.g.supplemental index ("Google hell")
November 26, 2024 Positive On-Page Factors ▪ Use of keywords at relevant places ▪ in title tag (preferably one of the first words) ▪ in URL and domain name ▪ in header tags (e.g.<h1>) and multiple times in body text ▪ Mobile usability ▪ mobile-first indexing by Google since 2016 ▪ Fast page load times ▪ mobile as well as desktop ▪ Provide metadata ▪ e.g.<meta name="description"> also used by search engines to create the text snippets on the SERPs
November 26, 2024 Negative On-Page Factors ▪ Links to "bad neighbourhood" ▪ Link selling ▪ in 2007 Google announced a campaign against paid links that transfer PageRank ▪ Over optimisation penalty (keyword stuffing) ▪ Text with same colour as background (hidden content) ▪ Automatic redirect via the refresh meta tag ▪ Cloaking ▪ different pages for spider and user ▪ Malware being hosted on the page
November 26, 2024 Negative On-Page Factors ... ▪ Duplicate or similar content ▪ Duplicate page titles or meta tags ▪ Slow page load time ▪ Any copyright violations ▪ ...
November 26, 2024 Positive Off-Page Factors ▪ Links from pages with a high PageRank ▪ Keywords in anchor text of inbound links ▪ Links from topically relevant sites ▪ High clickthrough rate (CTR) from search engine for a given keyword ▪ High number of shares on social media (social signals) ▪ e.g.Facebook or Twitter ▪ Site age (stability) ▪ Domain expiration date ▪ …
November 26, 2024 Negative Off-Page Factors ▪ Site often not accessible to crawlers ▪ e.g.server problem ▪ High bounce rate ▪ users immediately press the back button ▪ Link buying ▪ rapidly increasing number of inbound links ▪ Use of link farms ▪ Participation in link sharing programmes ▪ Links from bad neighbourhood? ▪ Competitor attack (e.g.via duplicate content)?
November 26, 2024 Black Hat Optimisations (Don'ts) ▪ Link farms ▪ Spamdexing in guestbooks, Wikipedia etc. ▪ "solution": <a rel="nofollow" href="...">...</a> ▪ Keyword Stuffing ▪ overuse of keywords - content keyword stuffing - image keyword stuffing - keywords in meta tags - invisible text with keywords ▪ Selling/buying links ▪ "big" business until 2007 ▪ costs based on the PageRank of the linking site
November 26, 2024 Black Hat Optimisations (Don'ts) ... ▪ Doorway pages (cloaking) ▪ doorway pages are normally just designed for search engines - user is automatically redirected to the target page ▪ e.g.BMW Germany and Ricoh Germany banned in February 2006
November 26, 2024 Nofollow Link Example ▪ nofollow value for hyperlinks introduced by Google in 2005 to avoid spamdexing ▪ <a rel="nofollow" href="...">...</a> ▪ Links with a nofollow value were not counted in the PageRank computation ▪ division by number of outgoing links ▪ e.g.page with 9 outgoing links and 3 of them are nofollow links - PageRank divided by 6 and distributed across the 6 "really linked pages" ▪ SEO experts started to use (misuse) the nofollow links for PageRank sculpting ▪ control flow of PageRank within a website
November 26, 2024 Nofollow Link Example ... ▪ In June 2009 Google decided to treat nofollow links differently to avoid PageRank sculpting ▪ division by total number of outgoing links ▪ e.g. page with 9 outgoing links and 3 of them are nofollow links - PageRank divided by 9 and distributed across the 6 "really linked pages" ▪ no longer a good solution to prevent Spamdexing since we loose (diffuse) some PageRank ▪ SEO experts start to use alternative techniques to replace nofollow links ▪ e.g.obfuscated JavaScript links
November 26, 2024 Non-Organic Search ▪ In addition to the so-called organic search, websites can also participate in non-organic web search ▪ cost per impression (CPI) ▪ cost-per-click (CPC) ▪ The non-organic web search should not be treated independently from the organic web search ▪ Quality of the landing page can have an impact on the non-organic web search performance! ▪ The Google Ads programme is an example of a commercial non-organic web search service ▪ other services include Yahoo! Advertising Solutions, Facebook Ads, ...
November 26, 2024 Google Ads and Google AdSense ▪ pay-per-click (PPC) or cost-per-thousand (CPM) ▪ Campaigns and ad groups ▪ Two types of advertising ▪ search ▪ content network - Google AdSense ▪ Highly customisable ads ▪ region ▪ language ▪ daytime ▪ ...
November 26, 2024 Google Ads ... ▪ Excellent control and monitoring for Ads users ▪ cost per conversion ▪ Google advertising revenues ▪ 2023: USD 237.86 billion (total revenues USD 305.6 billion)
November 26, 2024 Conclusions ▪ Web information retrieval techniques have to deal with the specific characteristics of the Web ▪ PageRank algorithm ▪ absolute quality of a page based on incoming links ▪ based on random surfer model ▪ computed as eigenvector of Google matrix G ▪ PageRank is just one factor ▪ Various implications for website development and SEO
November 26, 2024 References ▪ L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, January 1998 ▪ S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, 30(1-7), April 1998 ▪ http://ilpubs.stanford.edu:8090/361/1/1998-8.pdf ▪ Amy N. Langville and Carl D. Meyer, Google's PageRank and Beyond: The Science of Search Engine Rankings, Princeton University Press, July 2006
November 26, 2024 References … ▪ B. Sparrow, J. Liu and D.M. Wegner, Google Effects on Memory: Cognitive Consequences of Having Information at Our Fingertips, Science, July 2011 ▪ https://doi.org/10.1126/science.1207745 ▪ Google Search Central ▪ https://developers.google.com/search ▪ The W3C Markup Validation Service ▪ https://validator.w3.org ▪ SEO Book ▪ https://www.seobook.com