Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
2
190
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
iOSアプリで測る!名古屋駅までの 方向と距離
ryunakayama
0
160
KawaiiLT 登壇資料 キャリアとモチベーション
hiiragi
0
160
In geheimer Mission: AI Agents entwickeln
joergneumann
0
110
Cursorを活用したAIプログラミングについて 入門
rect
0
190
generative-ai-use-cases(GenU)の推しポイント ~2025年4月版~
hideg
1
390
eBPF超入門「o11yに使える」とは (20250424_eBPF_o11y)
thousanda
1
120
「理解」を重視したAI活用開発
fast_doctor
0
300
REALITY コマンド作成チュートリアル
nishiuriraku
0
120
2025年のz-index設計を考える
tak_dcxi
11
4k
KANNA Android の技術的課題と取り組み
watabee
1
460
はじめてのPDFKit.pdf
shomakato
0
100
The New Developer Workflow: How AI Transforms Ideas into Code
danielsogl
0
120
Featured
See All Featured
Building Adaptive Systems
keathley
41
2.5k
Measuring & Analyzing Core Web Vitals
bluesmoon
7
420
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Typedesign – Prime Four
hannesfritz
41
2.6k
Building an army of robots
kneath
305
45k
Site-Speed That Sticks
csswizardry
6
540
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
31
1.2k
RailsConf 2023
tenderlove
30
1.1k
A designer walks into a library…
pauljervisheath
205
24k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
160
15k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412