Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
JusBrasil
April 12, 2013
Programming
200
2
Share
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Other Decks in Programming
See All in Programming
Stage 3 Decorators でできること / できないこと / TSKaigi 2026
susisu
1
1.1k
プロパティの順序で型推論が壊れる!? TypeScript6.0の修正からContext-Sensitivityの仕組みを追う
bicstone
2
1k
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
310
Agent Skills を社内で育てる仕組み作り
jackchuka
1
2.4k
開発とはなにか、Essenceカーネルで見えるもの
ukin0k0
0
210
oxlintはeslint/typescript-eslintを置き換えられるのか
shomafujita
2
210
サーバーレスで作る、動画データ管理基盤
oyasumipants
0
260
SPMマルチモジュールで テストカバレッジを取得する技法
yosshi4486
0
120
AI Agent と正しく分析するための環境作り
yoshyum
2
610
Inside Stream API
skrb
1
160
ReactとSvelteのその先、Ripple-TS / Beyond React and Svelte: Ripple-TS
ssssota
3
830
LLM Plugin for Node-REDの利用方法と開発について
404background
0
110
Featured
See All Featured
The Invisible Side of Design
smashingmag
302
52k
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
1
570
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
270
We Are The Robots
honzajavorek
0
230
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
220
brightonSEO & MeasureFest 2025 - Christian Goodrich - Winning strategies for Black Friday CRO & PPC
cargoodrich
3
710
How To Speak Unicorn (iThemes Webinar)
marktimemedia
1
460
Technical Leadership for Architectural Decision Making
baasie
3
380
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Crafting Experiences
bethany
1
160
We Have a Design System, Now What?
morganepeng
55
8.1k
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
410
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412