Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
yujiosaka
February 22, 2018
1.4k
4
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな
yujiosaka
February 22, 2018
More Decks by yujiosaka
See All by yujiosaka
I was understanding WASM all wrong! 🤯
yujiosaka
2
320
Machine Learning with JavaScript
yujiosaka
0
230
JavaScriptでも機械学習がやりたかった話
yujiosaka
2
490
ヘッドレスChromeでクローラを作った後の話
yujiosaka
3
740
『XXX』のための管理画面
yujiosaka
1
1.4k
Enjoy Deep Learning by JavaScript
yujiosaka
1
400
ひたすら楽してディープラーニング
yujiosaka
20
13k
technology x business
yujiosaka
3
610
第二回もんご祭 パネルディスカッション
yujiosaka
0
920
Featured
See All Featured
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
220
The agentic SEO stack - context over prompts
schlessera
0
810
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.2k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
580
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
280
Heart Work Chapter 1 - Part 1
lfama
PRO
7
36k
A Soul's Torment
seathinner
6
2.9k
Side Projects
sachag
455
43k
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
410
Speed Design
sergeychernyshev
33
1.8k
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
118
120k
Agile Leadership in an Agile Organization
kimpetersen
PRO
0
160
Transcript
Yuji Isobe Զ͕࠷ॳʹϔουϨε ChromeͰΫϩʔϥ࡞ͬͨ ࣄʹͳΜͶʔ͔ͳ NodeֶԂ29࣌ݶ
min e ϓϩδΣΫτϚωʔδϟʔ at @yujiosaka https://speakerdeck.com/yujiosaka/hitasurale-sitedeipuraningu
✓ Կނ͍·͞ΒΫϩʔϥͳͷ͔ ✓ ԿΛࢦͯ͠࡞͔ͬͨ ✓ ԿΛߟ͑ͳ͕Β࡞͔ͬͨ ✓ ͜Ε͔ΒͷΫϩʔϥ ࠓճΫϩʔϥΛ࡞ͬͨ
ڈ৭Μͳ͜ͱΛͬͨ…
ECZine࿈ࡌ http://eczine.jp/article/detail/4869
ECઐՈσϏϡʔ http://amzn.asia/aOkwFjH
ࠔͬͨ(´ɾωɾʆ)
ձࣾͰΤϯδχΞͩͱ ࢥΘΕͳ͘ͳ͖ͬͯͨorz
ݸࣾຖʹνϡʔχϯάΛߦ͏ Ӧۀಉߦʹग़͔͚Δ ৽نϓϩμΫτͷఏҊ ӦۀࢿྉΛॻ͖࢝ΊΔ ϓϨεϦϦʔεΛॻ͖࢝ΊΔ ͍͚͑ͯͳ͍Ұઢ ←AIΤϯδχΞͰ͢͠ ← ٕज़Ӧۀ͔ͳ ←
BizDevͩΑͶ ← ͓ɺ͓͏… ←͍͋ͭ͏ ɹΤϯδχΞ͡ΌͶʔΘ
Ͱ͖ΕΤϯδχΞͱͯ͠ Ұੜ൧Λ৯͍͖͍ͬͯͨ
ձࣾͰΤϯδχΞͱͯ͠ͷ ଚݫΛ࠶ͼऔΓ͢
ͦΜͳ͋Δ࣌…
ϔουϨεChromeΛΔ https://developers.google.com/web/updates/2017/04/headless-chrome?hl=ja
✓ Chrome͕ϔουϨεϞʔυͰىಈͰ͖Δ ✓ ChromeͷىಈΦϓγϣϯʹʮ--headessʯΛՃ͑Δ͚ͩ ✓ දతͳϔουϨεϒϥβͱ͍͑PhantomJS ✓ ߴͰ҆ఆͯ͠ಈ࡞͢Δ ✓ ඪ४ͷରԠ͕ૣ͍ʢES2017Async-Await͕͑Δʣ
✓ ओͳ༻్ςετࣗಈԽͱಈతΫϩʔϥ ϔουϨεChromeͱ
✓ ੩తΫϩʔϥʢwgetcurlʣ ✓ υΩϡϝϯτʢHTMLϑΝΠϧʣͷϦΫΤετͷΈ ✓ ϑΝΠϧΛύʔε͢Δ͚ͩͳͷͰߴʹಈ࡞͢Δ ✓ AngularJSɺReactɺVue.jsͰ࡞ΒΕͨSPAαΠτͰಈ࡞͠ͳ͍ ✓ ಈతΫϩʔϥʢPhantomJSϔουϨεChromeʣ
✓ ը૾JavaScript͓ΑͼCSSΛಡΈࠐΜͰඳը·Ͱߦ͏ ✓ JavaScriptͷ࣮ߦ·Ͱߦ͏ͷͰҰൠతʹ ✓ SPAαΠτͰैདྷͷαΠτͱಉ͡Α͏ʹಈ࡞͢Δ ੩తΫϩʔϥ vs. ಈతΫϩʔϥ ※ উखͳ໋໊Ͱ͢
Chrome DevTools Protocol https://chromedevtools.github.io/devtools-protocol/ ✓ ࠷৽ͷ༷ Chromiumίʔυ্ͷ JSONϑΝΠϧ ✓ 1࣌ؒʹ1ճGitHubͷ
ϨϙδτϦʹίϐʔ ͞Ε͍ͯΔ
ϕϯνϚʔΫ https://hackernoon.com/benchmark-headless-chrome-vs-phantomjs-e7f44c6956c
RIP PhantomJS https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE
͜Ε͔Β࢝ΊΔͳΒ ϔουϨεChrome
✓ API͕Ϩϕϧա͗ͯѻ͍͕͍͠ ✓ ༷͕·ͩෆ҆ఆͰ͍͔͚Δͷ͕େม ✓ ηΩϡϦςΟͷϒϩοΫʹҾ͔͔ͬΔ ✓ Content Security PolicyͳͲɺϢʔβʔͷอޢ͕࡞ಈͯ͠͠·͏
✓ ΧδϡΞϧʹόάΛ౿Ή ✓ setRequestInterceptionͷ࣮͕·࣮ͩݧஈ֊ ͔͠͠ࢁੵΈ
✓ Google ChromeνʔϜ͕ ϝϯςφϯε ✓ ߴϨϕϧͷAPIͰϔουϨε Chrome͕ѻ͑Δϥούʔ ✓ 1݄ʹv1.0.0͕ϦϦʔε͞Εͨ ✓
Slackάϧʔϓ࡞ΒΕ ରԠஸೡͰૣ͍ GoogleChrome / puppeteer https://github.com/GoogleChrome/puppeteer
None
None
ϔουϨε ChromeͰ Ϋϩʔϥ
ͬͯͭ ϝονϟ ྲྀߦͬͯΔ ʙʙʙ
Զ͕ ࠷ॳʹ ࡞ͬͨ ͜ͱʹ ͳΜͶ ʔ͔ͳ
ؾ͍ͮͨ
puppeteer / examples https://github.com/GoogleChrome/puppeteer/tree/master/examples
ʮͬͯΈͨʯͱʮղઆʯ ͔ΓͰ࣮༻తͳͷগͳ͍
ϔουϨεChromeͰ࠷ॳͷ ࣮༻తͳΫϩʔϥΛ࡞Ζ͏
✓ طଘͷΫϩʔϥ͕PromiseʹରԠ͍ͯ͠ͳ͍ ✓ ࢄڥͰಈ࡞͢ΔNode.jsͷΫϩʔϥ͕ͳ͔ͬͨ ͦͷଞͷཧ༝
✓ ࣮༻తͳΫϩʔϥͱͯ͠ඞཁͳػೳΛຬ͍ͨͯ͠Δ ✓ υΩϡϝϯτ͕ӳޠͰॻ͔Ε͍ͯΔ ✓ ςετ͕ेΧόʔ͞Ε͍ͯΔ ✓ ࢄڥͰಈ࡞͢Δ ✓ APIγϯϓϧʹอͭ
✓ puppeteer / examples ʹࡌͤͯΒ͏ ΰʔϧΛܾΊΔ
͜ΕͰΤϯδχΞͱͯ͠ͷ ଚݫΛऔΓ͢
…
Ͱ͖ͨ https://github.com/yujiosaka/headless-chrome-crawler
ΰʔϧୡ https://github.com/GoogleChrome/puppeteer/tree/master/examples
Google Developersʹసࡌ https://developers.google.com/web/tools/puppeteer/examples
ΞΫηε͕૿͑ͯϏϏΔ
)$$SBXMFSMBVODI \ NBY%FQUI ୳ࡧ͢Δ࠷େͷਂ͞ NBY$PODVSSFODZ ࠷େฒྻ BMMPXFE%PNBJOT<bXXXFNJODPKQ> ڐՄ͞Ε͍ͯΔυϝΠϯ FWBMVBUF1BHF
bUJUMF UFYU ϖʔδ্ͰධՁ͞ΕΔؔ PO4VDDFTT SFTVMU\ޭ࣌ʹධՁ͞ΕΔؔ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXFNJODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ σϞ
Ϋϩʔϥ͕Ͱ͖Δ·Ͱ
✓ ʮΫϩʔϦϯάʯͱʮεΫϨΠϐϯάʯҧ͏ ✓ ΫϩʔϦϯάɿHTML͔ΒϦϯΫΛݟ͚ͭΔ ✓ εΫϨΠϐϯάɿHTML͔Βཉ͍͠ใΛݟ͚ͭΔ ✓ ͦΕͧΕ୯ମͰଘࡏͯ͠ҙຯ͕ͳ͍ ࠷ϛχϚϧͳΫϩʔϥ
ೋͭͷڞ௨Կ͔
HTML͔ΒɹɹɹΛݟ͚ͭΔ
ͦΕͬͯjQueryͰΑ͘Ͷʁ
jQuery: true, ϖʔδʹK2VFSZΛࣗಈૠೖ v1.0.0ϦϦʔε
)$$SBXMFSMBVODI \ K2VFSZUSVF FWBMVBUF1BHF bUJUMF UFYU PO4VDDFTT
SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXFNJODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ example
✓ ੩తΫϩʔϥʹ׳Ε͍ͯΔͱɺ͛͢ʔ͘ײ͡Δ ✓ ͻͬͦΓΤϥʔͰࢭ·ͬͯͨΓ͢ΔͱϚδͰԜΉ ΠϥΠϥ͠ͳ͍Ϋϩʔϥ
✓ λεΫΩϡʔͱΩϟογϡʹRedisΛ༻͍Δ ✓ ෳͷαʔόͰRedisΛڞ༗ ࢄڥͰಈ࡞ͤ͞Δ
cache: new RedisCache(), ΩϟογϡετϨʔδʹ3FEJTΛࢦఆ v1.3.0ϦϦʔε
)$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^ FWBMVBUF1BHF bUJUMF UFYU
PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ example )$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^ FWBMVBUF1BHF bUJUMF UFYU PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ )$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^ FWBMVBUF1BHF bUJUMF UFYU PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^
✓ ෯༏ઌ୳ࡧʢBFSʣˍਂ͞༏ઌ୳ࡧʢDFSʣ ✓ robots.txtʹै͏ ✓ XMLαΠτϚοϓ୳ࡧ ✓ σόΠεͷΤϛϡϨʔγϣϯ ✓ ϖʔδͷεΫϦʔϯγϣοτ
✓ JSON/CSVग़ྗ ͦͷଞͷػೳ
͜Ε͔ΒͷΫϩʔϥ
✓ ͜ͷΫϩʔϥͷͨΊʹαʔόʔ100ฒͯ ΫϩʔϦϯά͢ΔౕͳΜ͍ͯͳ͍͠ΊΜͲ͍͘͞ ✓ ίϚϯυҰൃͰࢄڥʹσϓϩΠͯ͠ཉ͍͠ ݱࡏͷ՝
None
✓ ߏཧʰπʔϧʱʹ͍ۙ ✓ AWS LambdaɺAzure Functionsɺ Google CloudFunctionsΛ༰қʹσϓϩΠɾ࣮ߦ ✓ Node.js,
Python, Java, Scala, C#, F#, Go, Groovy, Kotlin, PHP & SwiftΛαϙʔτ ✓ ศརͳϓϥάΠϯͨ͘͞Μ Serverless Frameworkͱ
yarn (npm run) deploy yarn (npm run) start v2.0.0 will
be… "84-BNCEBʹσϓϩΠ ฒྻͰΫϩʔϦϯά։࢝
Զ͕࠷ॳʹϔουϨε ChromeͰΫϩʔϥ ࡞ͬͨࣄʹͳΜͶʔ͔ͳ
Զ͕࠷ॳʹϔουϨε ChromeͰ࣮༻తͳΫϩʔϥ ࡞ͬͨࣄʹͳΜͶʔ͔ͳ
͚ͩͲຊɺࣄͰ ͬͱίʔυΛॻ͖͍ͨ
WE ARE HIRING https://www.emin.co.jp/blog/news/1527/ ηʔϧε