Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
200
2
Share
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Other Decks in Programming
See All in Programming
Running Swift without an OS
kishikawakatsumi
0
750
Oxlintとeslint-plugin-react-hooks 明日から始められそう?
t6adev
0
120
Mastering Event Sourcing: Your Parents Holidayed in Yugoslavia
super_marek
0
150
L’IA au service des devs : Anatomie d'un assistant de Code Review
toham
0
220
The Monolith Strikes Back: Why AI Agents ❤️ Rails Monoliths
serradura
0
300
今年もTECHSCOREブログを書き続けます!
hiraoku101
0
240
PHPのバージョンアップ時にも役立ったAST(2026年版)
matsuo_atsushi
0
300
年間50登壇、単著出版、雑誌寄稿、Podcast出演、YouTube、CM、カンファレンス主催……全部やってみたので面白さ等を比較してみよう / I’ve tried them all, so let’s compare how interesting they are.
nrslib
4
770
Java 21/25 Virtual Threads 소개
debop
0
340
ルールルルルルRubyの中身の予備知識 ── RubyKaigiの前に予習しなイカ?
ydah
0
120
forteeの改修から振り返るPHPerKaigi 2026
muno92
PRO
3
260
ドメインイベントでビジネスロジックを解きほぐす #phpcon_odawara
kajitack
2
130
Featured
See All Featured
Game over? The fight for quality and originality in the time of robots
wayneb77
1
160
Why Our Code Smells
bkeepers
PRO
340
58k
The AI Search Optimization Roadmap by Aleyda Solis
aleyda
1
5.6k
New Earth Scene 8
popppiees
3
2k
Into the Great Unknown - MozCon
thekraken
40
2.3k
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.1k
Exploring anti-patterns in Rails
aemeredith
3
310
So, you think you're a good person
axbom
PRO
2
2k
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
130
Navigating Team Friction
lara
192
16k
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
1.9k
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
1
490
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412