Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Pa...
Search
Shinichi Nakagawa
PRO
June 26, 2021
Programming
1
320
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player
kawasaki.rb #097 9年目突入LT大会 (オンライン) 記念LT
#Python #requests-html #Web #Baseball
Shinichi Nakagawa
PRO
June 26, 2021
Tweet
Share
More Decks by Shinichi Nakagawa
See All by Shinichi Nakagawa
AI・LLM事業部のSREとタスクの自動運転
shinyorke
PRO
0
300
実践Dash - 手を抜きながら本気で作るデータApplicationの基本と応用 / Dash for Python and Baseball
shinyorke
PRO
2
2.4k
Terraform, GitHub Actions, Cloud Buildでデータ基盤をProvisioningする / Data Platform provisioning for Google Cloud and Terraform
shinyorke
PRO
2
3.1k
Cloud RunとCloud PubSubでサーバレスなデータ基盤2024 with Terraform / Cloud Run and PubSub with Terraform
shinyorke
PRO
9
3.7k
自らを強いエンジニアにするための3つの習慣 / I need to be myself, I can't be no one else
shinyorke
PRO
82
83k
阪神タイガース優勝のひみつ - Pythonでシュッと調べた件 / SABRmetrics for Python
shinyorke
PRO
1
1.4k
Pythonとクラウドと野球の推し活. / Baseball Data Platform for Python and Google Cloud
shinyorke
PRO
2
2.8k
月額コーヒー3.34杯分のコストでオオタニサンの活躍を見守るデータ基盤のはなし / Pyhack Con
shinyorke
PRO
2
500
俺のDXを実現するためのサーバレスなデータ基盤開発と運用 / Serverless Data Platform and Baseball
shinyorke
PRO
5
12k
Other Decks in Programming
See All in Programming
goにおける コネクションプールの仕組み を軽く掘って見た
aronokuyama
0
140
Boost Your Performance and Developer Productivity with Jakarta EE 11
ivargrimstad
0
150
読もう! Android build ドキュメント
andpad
1
240
Go1.24で testing.B.Loopが爆誕
kuro_kurorrr
0
160
複数ドメインに散らばってしまった画像…! 運用中のPHPアプリに後からCDNを導入する…!
suguruooki
0
430
Denoでフロントエンド開発 2025年春版 / Frontend Development with Deno (Spring 2025)
petamoriken
1
1.3k
OpenTelemetryを活用したObservability入門 / Introduction to Observability with OpenTelemetry
seike460
PRO
0
340
小さく段階的リリースすることで深夜メンテを回避する
mkmk884
2
130
ミリしらMCP勉強会
watany
2
420
データベースエンジニアの仕事を楽にする。PgAssistantの紹介
nnaka2992
9
4.3k
Go1.24 go vetとtestsアナライザ
kuro_kurorrr
2
480
RubyKaigiで手に入れた HHKB Studioのための HIDRawドライバ
iberianpig
0
1k
Featured
See All Featured
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
8
700
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.2k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.5k
YesSQL, Process and Tooling at Scale
rocio
172
14k
A Modern Web Designer's Workflow
chriscoyier
693
190k
How to train your dragon (web standard)
notwaldorf
91
5.9k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
12k
The Pragmatic Product Professional
lauravandoore
33
6.5k
Become a Pro
speakerdeck
PRO
27
5.2k
Statistics for Hackers
jakevdp
798
220k
Being A Developer After 40
akosma
90
590k
Large-scale JavaScript Application Architecture
addyosmani
511
110k
Transcript
ਓؒ͡Όͳͯ͘ ٿͷͨΊͷ εΫϨΠϐϯάͱͯ͠ͷ requests-html ͘͠ʮٿͰ͡ΊΔػցֶशୈೋষʯ Shinichi Nakagawa(@shinyorke)
ࠓͷ͓ͳ͠⽁ • ⚾AIͷ༧ଌσʔλΛಘΔͨΊͷΫϩʔϥʔΛ requests-htmlͰ։ൃ&ʢࡶͰ͕͢ʣެ։ͨ͠ • Cloud Functions + Pub/Sub
+ SchedulerͰ ͬ͘͞ΓͰ͖ͪΌ͏ऩूαʔϏε • Scrapyͱ͔৭ʑ͚ͬͨͲࠓͩͱrequests-html͔ͳ͋
Who am I?ʢ͓લ୭Αʣ • Shinichi Nakagawa(@shinyorke) • JX௨৴ࣾγχΞΤϯδχΞ • ٿσʔλαΠΤϯςΟετ
• #kwskrb Λ #kwskpy ͱ͔ݴͬͯ͠·͏ਓ • #kwskrb 9प͓ΊͰͱ͏͍͟͝·͢🎉
͜Εͷٕज़తͳωλ͕ࠓͷ ٿAI͕બͿTOKYO 2020ࣆJAPAN24໊ - ػցֶशͰແ͘બΜͰΈͨ. https://shinyorke.hatenablog.com/entry/tokyo2020-samurai-japan
ٿAIʹΑΔࣆδϟύϯબग़ 1.ϝδϟʔϦʔάͷΦʔϓϯσʔλΛͬͯ ٿબखͷ༧ଌϞσϧΛ։ൃ 2.1.ͷ༧ଌϞσϧʹ2021ϓϩٿʮ΄΅ʯશબखͷΛ ৯Θͤͯ2021ͷΛউखʹ༧ଌ 3.༧ଌͷOPSʢଧऀʣ, FIPʢखʣͰྑ͔ͬͨॱ
&ϙδγϣϯɾଧͷࠨӈΛௐͯ͠24໊Λબग़
None
༧ଌσʔλͷ݅ʢ=ಛྔूΊʣ • खɾଧऀͷجຊతͳʢଧ, ଧ, ޚ, ඃຊྥଧetc…ʣ • ग़ϙδγϣϯ. Ͱ͖Εελϝϯͱͯ͠ͷճ͕·͍͠. •
্هΛσʔλߏɾϥΠηϯεڞʹͳ͘ΕΔσʔλ͕ ΞϝϦΧʹ͋ͬͨ, Baseball Referenceͬͯͭ. • https://www.baseball-reference.com/register/league.cgi?id=16632292 https://www.baseball-reference.com/register/league.cgi?id=0549ac26
requests-htmlͰటष͘ΫϩʔϥʔΛ࡞Δ • ʢٿAIͷ݅ͱผͷͰʣࠓͲ͖ͷΫϩʔϥʔͬͯ🤔 ͱ, ࣗࣾSlackͷtimesνϟϯωϧͰᄁ͍ͨΒrequests-htmlΛ קΊΒΕͨ • ৮ͬͨΒ͔֬ʹ͍͍ײͩͬͨ͡
-> ؾ͕͚ͭΫϩʔϥʔ requests-htmlϝΠϯʹ • ઌड़ͷٿσʔλऩूrequests-htmlͰ࡞ͬͨ https://github.com/Shinichi-Nakagawa/br-scraping-npb
requests-htmlͷྑ͔ͬͨͱ͜Ζ • γϯϓϧʹ͍͍͢ʢࡶʣ • ٿͷϖʔδ͕JSΰϦΰϦͷهड़͕ͩͬͨ render()ҰൃͰHTMLͱͯ͠औΕͨ • ਓؒΒ͍͔͠Ͳ͏͔ո͍͚͠Ͳ
खஈͱͯ͠ྑ͍ͷͰͳ͍Ͱ͠ΐ͏͔
JS->HTML͕͜ΕͰࡁΜͩ # νʔϜ͝ͱ, खͱख, ͚ͯอଘ for team in teams :
response = session.get(team['url'] ) response.html.render(timeout=60) # ίίͰJS͕HTMLʹϨϯμϦϯά͞ΕΔ tbody = response.html.find('#team_batting > tbody', first=True ) batters = players(tbody ) write_csv(f'dataset/player_batter_{team["team"].replace(" ", "")}.csv', batters, fieldnames ) tbody = response.html.find('#team_pitching > tbody', first=True ) pitchers = players(tbody ) write_csv(f'dataset/player_pitcher_{team["team"].replace(" ", "")}.csv', pitchers, fieldnames ) https://github.com/Shinichi-Nakagawa/br-scraping-npb/blob/main/players.py#L28
ఆظతʹಈ͔͢Ϋϩʔϥʔͱͯ͠ӡ༻ • AIࣆJAPANҰճϙοΩϦͷϓϩδΣΫτͳͷͰ͍͍ͱͯ͠ • ݸਓతʹຖूΊͯΔσʔλ͕͋ͬͨΓ͢Δ αΠτऩूͯ͠SlackʹͭͿ͔ͤͨΓBigQueryʹอଘͨ͠Γ • requests-htmlΛͬͨίʔυΛ
GCF + Pub/Sub + SchedulerͰӡ༻
࣮ࡍӡ༻͍ͯ͠·͢ খ͍͞ϓϩμΫτ։ൃʹ͓͚ΔGCPར༻ͷצͲ͜Ζ - ݸਓతͳϓϩμΫτΛࡾͰϩʔϯνͨ͠ https://shinyorke.hatenablog.com/entry/gcp-slack-taida
݁ͼ • ࠓͲ͖ͷPythonͷΫϩʔϥʔ։ൃ, requests-html͕ͤ • ScrapyΈ͍ͨʹԿͰग़དྷΔΘ͚͡Όͳ͍͚Ͳ ॳखͷಋೖίετͱ͔͍͠Φεεϝ. • Google
Cloud Functionsʢͬͯͳ͍͚ͲʣAWS LambdaͰ ࡶʹӡ༻͢Δͷʹ߹ͬͯΔͱࢥΘΕ. ۩ମྫ͍ͣΕϒϩάʹ.
ήʔϜηοτ⽁