Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
2
190
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
AIと私たちの学習の変化を考える - Claude Codeの学習モードを例に
azukiazusa1
10
3.9k
Laravel Boost 超入門
fire_arlo
3
210
@Environment(\.keyPath)那么好我不允许你们不知道! / atEnvironment keyPath is so good and you should know it!
lovee
0
120
FindyにおけるTakumi活用と脆弱性管理のこれから
rvirus0817
0
500
Putting The Genie in the Bottle - A Crash Course on running LLMs on Android
iurysza
0
140
Go言語での実装を通して学ぶLLMファインチューニングの仕組み / fukuokago22-llm-peft
monochromegane
0
120
今から始めるClaude Code入門〜AIコーディングエージェントの歴史と導入〜
nokomoro3
0
130
さようなら Date。 ようこそTemporal! 3年間先行利用して得られた知見の共有
8beeeaaat
3
1.4k
モバイルアプリからWebへの横展開を加速した話_Claude_Code_実践術.pdf
kazuyasakamoto
0
320
Ruby Parser progress report 2025
yui_knk
1
440
Kiroの仕様駆動開発から見えてきたAIコーディングとの正しい付き合い方
clshinji
1
210
How Android Uses Data Structures Behind The Scenes
l2hyunwoo
0
440
Featured
See All Featured
Navigating Team Friction
lara
189
15k
How GitHub (no longer) Works
holman
315
140k
The World Runs on Bad Software
bkeepers
PRO
70
11k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
[RailsConf 2023] Rails as a piece of cake
palkan
57
5.8k
Become a Pro
speakerdeck
PRO
29
5.5k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
7
840
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
31
2.2k
Making Projects Easy
brettharned
117
6.4k
Balancing Empowerment & Direction
lara
3
620
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412