Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web Scraping | George Maharjan | Gurzu

Avatar for Gurzu Gurzu
June 09, 2025

Web Scraping | George Maharjan | Gurzu

Learn how to turn websites into data with precision and purpose. Discover the tools and techniques for effective web mining.

Avatar for Gurzu

Gurzu

June 09, 2025
Tweet

More Decks by Gurzu

Other Decks in Technology

Transcript

  1. Web Scraping: Behind the Bots and Tools of Mining the

    Web Learn how to turn websites into data with precision and purpose. Discover the tools and techniques for effective web mining. by George Maharjan
  2. Web scraping is the process of using automated tools to

    extract data from websites. It’s like sending a robot to a website to read the content (like a human would) and copy the information into a format you can use — like a spreadsheet or a database. What Is Web Scraping?
  3. Decisions Data-driven insights for business strategy Visibility Competitor pricing and

    market trends Access Public data beyond API limitations Web scraping unlocks valuable information that drives competitive advantage. It provides access to data otherwise difficult to obtain systematically. What Is Web Scraping?
  4. Common Use Cases Use Case Industry Example Price Monitoring E-commerce

    Track Amazon product prices Job Aggregation Recruitment Scrape listings from Indeed Real Estate Property Tech Monitor housing prices News Aggregation Media Collect headlines and summaries
  5. How Web Scraping Works Send Request Access the target URL

    Receive HTML Raw webpage source code Parse HTML Use DOM parsers or selectors Store Results Save as CSV, DB, or JSON
  6. Tools of the Trade Libraries Nokogiri, BeautifulSoup, or Selenium based

    on content type: static HTML, dynamic pages, or user interaction. No-Code & Cloud Octoparse, ParseHub, or Web Scraper to quickly extract data without coding, ideal for simple or one-off projects. Headless Browsers Selenium, Playwright and Puppeteer for scraping JavaScript-heavy or interactive web pages without UI display.
  7. Best Practices for Responsible Scraping Research First Check for APIs

    before scraping Throttle Requests Use delays between requests Rotate IPs and User Agent Avoid triggering security measures Respect Privacy Never collect sensitive personal data Scrape smart, scrape right. The web contains valuable data, but access it responsibly and ethically.
  8. Ethics and Legal Considerations Terms of Service Many websites explicitly

    prohibit scraping in their terms. Robots.txt This file indicates which parts of a site can be crawled. Server Load Excessive requests can overload websites and disrupt service. Privacy Collecting personal data raises serious ethical and legal issues. The LinkedIn vs. hiQ Labs case highlighted the legal gray areas. Public data accessibility doesn't automatically mean scraping is permitted.