Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
JusBrasil
April 12, 2013
Programming
200
2
Share
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Other Decks in Programming
See All in Programming
How Swift's Type System Guides AI Agents
koher
0
320
[RubyKaigi 2026] Require Hooks
palkan
1
270
運転動画を検索可能にする〜Cosmos-Embed1とDatabricks Vector Searchで〜/cosmos-embed1-databricks-vector-search
studio_graph
1
540
Oxlintとeslint-plugin-react-hooks 明日から始められそう?
t6adev
0
310
PHP で mp3 プレイヤーを実装しよう
m3m0r7
PRO
0
290
How We Benchmarked Quarkus: Patterns and anti-patterns
hollycummins
1
170
GNU Makeの使い方 / How to use GNU Make
kaityo256
PRO
16
5.6k
Don't Prompt Harder, Structure Better
kitasuke
0
800
【26新卒研修】OpenAPI/Swagger REST API研修
dip_tech
PRO
0
110
Surviving Black Friday: 329 billion requests with Falcon!
ioquatix
0
2.3k
PCOVから学ぶコードカバレッジ #phpcon_odawara
o0h
PRO
0
290
【26新卒研修資料】TDD実装演習
dip_tech
PRO
0
140
Featured
See All Featured
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
190
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.7k
Everyday Curiosity
cassininazir
0
200
Unsuck your backbone
ammeep
672
58k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
210
What does AI have to do with Human Rights?
axbom
PRO
1
2.1k
Rails Girls Zürich Keynote
gr2m
96
14k
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
180
Why Our Code Smells
bkeepers
PRO
340
58k
Fireside Chat
paigeccino
42
3.9k
Large-scale JavaScript Application Architecture
addyosmani
515
110k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412