Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
2
200
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
Go Conference mini in Sendai 2026 : Goに新機能を提案し実装されるまでのフロー徹底解説
yamatoya
0
570
AI時代のシステム設計:ドメインモデルで変更しやすさを守る設計戦略
masuda220
PRO
5
920
How to stabilize UI tests using XCTest
akkeylab
0
120
Agent Skills Workshop - AIへの頼み方を仕組み化する
gotalab555
15
8.6k
コードレビューをしない選択 #でぃーぷらすトウキョウ
kajitack
3
920
Windows on Ryzen and I
seosoft
0
260
野球解説AI Agentを開発してみた - 2026/02/27 LayerX社内LT会資料
shinyorke
PRO
0
260
encoding/json/v2のUnmarshalはこう変わった:内部実装で見る設計改善
kurakura0916
0
400
Docコメントで始める簡単ガードレール
keisukeikeda
1
110
ポーリング処理廃止によるイベント駆動アーキテクチャへの移行
seitarof
3
1k
Ruby x Terminal
a_matsuda
7
590
メタプログラミングで実現する「コードを仕様にする」仕組み/nikkei-tech-talk43
nikkei_engineer_recruiting
0
180
Featured
See All Featured
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
300
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
100
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
3
480
The Cult of Friendly URLs
andyhume
79
6.8k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
470
Making Projects Easy
brettharned
120
6.6k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Balancing Empowerment & Direction
lara
5
940
Thoughts on Productivity
jonyablonski
75
5.1k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
61
52k
YesSQL, Process and Tooling at Scale
rocio
174
15k
ラッコキーワード サービス紹介資料
rakko
1
2.6M
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412