Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
2
190
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
コントリビューターによるDenoのすゝめ / Deno Recommendations by a Contributor
petamoriken
0
200
Honoを使ったリモートMCPサーバでAIツールとの連携を加速させる!
tosuri13
1
170
Basic Architectures
denyspoltorak
0
650
Fluid Templating in TYPO3 14
s2b
0
120
humanlayerのブログから学ぶ、良いCLAUDE.mdの書き方
tsukamoto1783
0
170
Apache Iceberg V3 and migration to V3
tomtanaka
0
130
副作用をどこに置くか問題:オブジェクト指向で整理する設計判断ツリー
koxya
1
570
Data-Centric Kaggle
isax1015
2
750
AI 駆動開発ライフサイクル(AI-DLC):ソフトウェアエンジニアリングの再構築 / AI-DLC Introduction
kanamasa
11
6.2k
KIKI_MBSD Cybersecurity Challenges 2025
ikema
0
1.3k
AI時代の認知負荷との向き合い方
optfit
0
130
2026年 エンジニアリング自己学習法
yumechi
0
120
Featured
See All Featured
Ethics towards AI in product and experience design
skipperchong
2
190
Mobile First: as difficult as doing things right
swwweet
225
10k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
38
2.7k
Technical Leadership for Architectural Decision Making
baasie
1
230
Agile Actions for Facilitating Distributed Teams - ADO2019
mkilby
0
110
Reflections from 52 weeks, 52 projects
jeffersonlam
356
21k
The AI Search Optimization Roadmap by Aleyda Solis
aleyda
1
5.2k
Joys of Absence: A Defence of Solitary Play
codingconduct
1
280
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.3k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
Crafting Experiences
bethany
1
44
Test your architecture with Archunit
thirion
1
2.1k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412