the web. • Spider: A piece a software designed to extract links and items from webpages. • Crawl: Visit all the pages of interest on a site using your spider. • Scrapy Cloud: Hosted crawling at scrapinghub.com
spiders. • SitemapSpider — Allows you to crawl a site by discovering the URLs using Sitemaps. • XMLFeedSpider — XMLFeedSpider is designed for parsing XML feeds by iterating through them by a certain node name. • CrawlSpider — It provides a convenient mechanism for following links by defining a set of rules.
desarrollo. • project-prod • project-dev • En desarrollo se hace la verificacion de los datos y todas las posibles transformaciones en los datos. • shub permite deployar usando el ultimo commit de la rama.
tags from HTML snippets • extract base url from HTML snippets • translate entites on HTML strings • convert raw HTTP headers to dicts and vice-versa • construct HTTP auth header • converting HTML pages to unicode • sanitize urls (like browsers do) • extract arguments from urls