Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
2
190
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
DMMを支える決済基盤の技術的負債にどう立ち向かうか / Addressing Technical Debt in Payment Infrastructure
yoshiyoshifujii
3
220
Composerが「依存解決」のためにどんな工夫をしているか #phpcon
o0h
PRO
1
340
なぜ適用するか、移行して理解するClean Architecture 〜構造を超えて設計を継承する〜 / Why Apply, Migrate and Understand Clean Architecture - Inherit Design Beyond Structure
seike460
PRO
3
790
LT 2025-06-30: プロダクトエンジニアの役割
yamamotok
0
850
NPOでのDevinの活用
codeforeveryone
0
890
AI時代の『改訂新版 良いコード/悪いコードで学ぶ設計入門』 / ai-good-code-bad-code
minodriven
23
9.4k
ソフトウェア品質を数字で捉える技術。事業成長を支えるシステム品質の マネジメント
takuya542
2
15k
チームのテスト力を総合的に鍛えて品質、スピード、レジリエンスを共立させる/Testing approach that improves quality, speed, and resilience
goyoki
5
1.1k
ISUCON研修おかわり会 講義スライド
arfes0e2b3c
1
460
ふつうの技術スタックでアート作品を作ってみる
akira888
1
1.3k
明示と暗黙 ー PHPとGoの インターフェイスの違いを知る
shimabox
2
610
20250708_JAWS_opscdk
takuyay0ne
2
130
Featured
See All Featured
Producing Creativity
orderedlist
PRO
346
40k
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
We Have a Design System, Now What?
morganepeng
53
7.7k
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.9k
Being A Developer After 40
akosma
90
590k
Thoughts on Productivity
jonyablonski
69
4.7k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
970
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Product Roadmaps are Hard
iamctodd
PRO
54
11k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.7k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
31
1.3k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412