Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
2
190
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
AIレビュアーをスケールさせるには / Scaling AI Reviewers
technuma
2
210
WebAssemblyインタプリタを書く ~Component Modelを添えて~
ruccho
1
870
Scale out your Claude Code ~自社専用Agentで10xする開発プロセス~
yukukotani
9
2.4k
DockerからECSへ 〜 AWSの海に出る前に知っておきたいこと 〜
ota1022
5
1.7k
オープンセミナー2025@広島「君はどこで動かすか?」アンケート結果
satoshi256kbyte
0
150
Portapad紹介プレゼンテーション
gotoumakakeru
1
130
Flutterと Vibe Coding で個人開発!
hyshu
1
260
学習を成果に繋げるための個人開発の考え方 〜 「学習のための個人開発」のすすめ / personal project for leaning
panda_program
1
100
Vibe coding コードレビュー
kinopeee
0
460
兎に角、コードレビュー
mitohato14
0
140
オホーツクでコミュニティを立ち上げた理由―地方出身プログラマの挑戦 / TechRAMEN 2025 Conference
lemonade_37
2
480
Webinar: AI-Powered Development: Transformiere deinen Workflow mit Coding Tools und MCP Servern
danielsogl
0
140
Featured
See All Featured
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
44
2.4k
RailsConf 2023
tenderlove
30
1.2k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Why Our Code Smells
bkeepers
PRO
338
57k
Java REST API Framework Comparison - PWX 2021
mraible
33
8.8k
Scaling GitHub
holman
462
140k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
1k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
283
13k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.6k
The Invisible Side of Design
smashingmag
301
51k
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412