Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Pa...
Search
Shinichi Nakagawa
PRO
June 26, 2021
Programming
350
1
Share
人間じゃなくて野球のためのスクレイピングとしてのrequests-html / HTML Parsing for Baseball Player
kawasaki.rb #097 9年目突入LT大会 (オンライン) 記念LT
#Python #requests-html #Web #Baseball
Shinichi Nakagawa
PRO
June 26, 2021
More Decks by Shinichi Nakagawa
See All by Shinichi Nakagawa
野球解説AI Agentを開発してみた - 2026/02/27 LayerX社内LT会資料
shinyorke
PRO
0
460
WBCの解説は生成AIにやらせよう - 生成AIで野球解説者AI Agentを実現する / Baseball Commentator AI Agent for Gemini
shinyorke
PRO
1
440
自らを強いエンジニアにするための3つの習慣 2025/ Fitter happier more productive
shinyorke
PRO
0
290
生成AI時代におけるSREの進化とキャリア戦略 / Building an Embedded SRE team and my career
shinyorke
PRO
0
160
生成AIを活用した野球データ分析 - メジャーリーグ編 / Baseball Analytics for Gen AI
shinyorke
PRO
1
6.3k
ゼロから始めるSREの事業貢献 - 生成AI時代のSRE成長戦略と実践 / Starting SRE from Day One
shinyorke
PRO
3
7.8k
AI・LLM事業部のSREとタスクの自動運転
shinyorke
PRO
0
550
実践Dash - 手を抜きながら本気で作るデータApplicationの基本と応用 / Dash for Python and Baseball
shinyorke
PRO
2
4.5k
Terraform, GitHub Actions, Cloud Buildでデータ基盤をProvisioningする / Data Platform provisioning for Google Cloud and Terraform
shinyorke
PRO
2
3.7k
Other Decks in Programming
See All in Programming
AIとRubyの静的型付け
ukin0k0
0
400
CSC307 Lecture 17
javiergs
PRO
0
260
TypeSpec で繋ぐ複数プロダクトの型安全
maroon8021
1
260
AIエージェントと協働するCLI開発 — BunとOpenClawで学んだこと
yoshikouki
1
220
開発体験を左右するライブラリの API 設計 - GraphQL スキーマ構築ライブラリから考える #tskaigi
izumin5210
2
1.2k
Stage 3 Decorators でできること / できないこと / TSKaigi 2026
susisu
1
1.3k
Hive Metastoreを通して学ぶIceberg REST Catalog ― 仕様から実装まで
okumin
0
300
ReactとSvelteのその先、Ripple-TS / Beyond React and Svelte: Ripple-TS
ssssota
3
1.6k
New "Type" system on PicoRuby
pocke
1
310
3Dシーンの圧縮
fadis
1
450
OCRを使ってゲームのアイテムをデータ化する
kishikawakatsumi
0
120
新規プロダクトを高速で生み出すハーネスエンジニアリング
seanchas116
15
7.1k
Featured
See All Featured
How Software Deployment tools have changed in the past 20 years
geshan
0
34k
Imperfection Machines: The Place of Print at Facebook
scottboms
270
14k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.2k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
580
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.4k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
10k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
9.1k
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
210
Mind Mapping
helmedeiros
PRO
1
210
How STYLIGHT went responsive
nonsquared
100
6.1k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.4k
Transcript
ਓؒ͡Όͳͯ͘ ٿͷͨΊͷ εΫϨΠϐϯάͱͯ͠ͷ requests-html ͘͠ʮٿͰ͡ΊΔػցֶशୈೋষʯ Shinichi Nakagawa(@shinyorke)
ࠓͷ͓ͳ͠⽁ • ⚾AIͷ༧ଌσʔλΛಘΔͨΊͷΫϩʔϥʔΛ requests-htmlͰ։ൃ&ʢࡶͰ͕͢ʣެ։ͨ͠ • Cloud Functions + Pub/Sub
+ SchedulerͰ ͬ͘͞ΓͰ͖ͪΌ͏ऩूαʔϏε • Scrapyͱ͔৭ʑ͚ͬͨͲࠓͩͱrequests-html͔ͳ͋
Who am I?ʢ͓લ୭Αʣ • Shinichi Nakagawa(@shinyorke) • JX௨৴ࣾγχΞΤϯδχΞ • ٿσʔλαΠΤϯςΟετ
• #kwskrb Λ #kwskpy ͱ͔ݴͬͯ͠·͏ਓ • #kwskrb 9प͓ΊͰͱ͏͍͟͝·͢🎉
͜Εͷٕज़తͳωλ͕ࠓͷ ٿAI͕બͿTOKYO 2020ࣆJAPAN24໊ - ػցֶशͰແ͘બΜͰΈͨ. https://shinyorke.hatenablog.com/entry/tokyo2020-samurai-japan
ٿAIʹΑΔࣆδϟύϯબग़ 1.ϝδϟʔϦʔάͷΦʔϓϯσʔλΛͬͯ ٿબखͷ༧ଌϞσϧΛ։ൃ 2.1.ͷ༧ଌϞσϧʹ2021ϓϩٿʮ΄΅ʯશબखͷΛ ৯Θͤͯ2021ͷΛউखʹ༧ଌ 3.༧ଌͷOPSʢଧऀʣ, FIPʢखʣͰྑ͔ͬͨॱ
&ϙδγϣϯɾଧͷࠨӈΛௐͯ͠24໊Λબग़
None
༧ଌσʔλͷ݅ʢ=ಛྔूΊʣ • खɾଧऀͷجຊతͳʢଧ, ଧ, ޚ, ඃຊྥଧetc…ʣ • ग़ϙδγϣϯ. Ͱ͖Εελϝϯͱͯ͠ͷճ͕·͍͠. •
্هΛσʔλߏɾϥΠηϯεڞʹͳ͘ΕΔσʔλ͕ ΞϝϦΧʹ͋ͬͨ, Baseball Referenceͬͯͭ. • https://www.baseball-reference.com/register/league.cgi?id=16632292 https://www.baseball-reference.com/register/league.cgi?id=0549ac26
requests-htmlͰటष͘ΫϩʔϥʔΛ࡞Δ • ʢٿAIͷ݅ͱผͷͰʣࠓͲ͖ͷΫϩʔϥʔͬͯ🤔 ͱ, ࣗࣾSlackͷtimesνϟϯωϧͰᄁ͍ͨΒrequests-htmlΛ קΊΒΕͨ • ৮ͬͨΒ͔֬ʹ͍͍ײͩͬͨ͡
-> ؾ͕͚ͭΫϩʔϥʔ requests-htmlϝΠϯʹ • ઌड़ͷٿσʔλऩूrequests-htmlͰ࡞ͬͨ https://github.com/Shinichi-Nakagawa/br-scraping-npb
requests-htmlͷྑ͔ͬͨͱ͜Ζ • γϯϓϧʹ͍͍͢ʢࡶʣ • ٿͷϖʔδ͕JSΰϦΰϦͷهड़͕ͩͬͨ render()ҰൃͰHTMLͱͯ͠औΕͨ • ਓؒΒ͍͔͠Ͳ͏͔ո͍͚͠Ͳ
खஈͱͯ͠ྑ͍ͷͰͳ͍Ͱ͠ΐ͏͔
JS->HTML͕͜ΕͰࡁΜͩ # νʔϜ͝ͱ, खͱख, ͚ͯอଘ for team in teams :
response = session.get(team['url'] ) response.html.render(timeout=60) # ίίͰJS͕HTMLʹϨϯμϦϯά͞ΕΔ tbody = response.html.find('#team_batting > tbody', first=True ) batters = players(tbody ) write_csv(f'dataset/player_batter_{team["team"].replace(" ", "")}.csv', batters, fieldnames ) tbody = response.html.find('#team_pitching > tbody', first=True ) pitchers = players(tbody ) write_csv(f'dataset/player_pitcher_{team["team"].replace(" ", "")}.csv', pitchers, fieldnames ) https://github.com/Shinichi-Nakagawa/br-scraping-npb/blob/main/players.py#L28
ఆظతʹಈ͔͢Ϋϩʔϥʔͱͯ͠ӡ༻ • AIࣆJAPANҰճϙοΩϦͷϓϩδΣΫτͳͷͰ͍͍ͱͯ͠ • ݸਓతʹຖूΊͯΔσʔλ͕͋ͬͨΓ͢Δ αΠτऩूͯ͠SlackʹͭͿ͔ͤͨΓBigQueryʹอଘͨ͠Γ • requests-htmlΛͬͨίʔυΛ
GCF + Pub/Sub + SchedulerͰӡ༻
࣮ࡍӡ༻͍ͯ͠·͢ খ͍͞ϓϩμΫτ։ൃʹ͓͚ΔGCPར༻ͷצͲ͜Ζ - ݸਓతͳϓϩμΫτΛࡾͰϩʔϯνͨ͠ https://shinyorke.hatenablog.com/entry/gcp-slack-taida
݁ͼ • ࠓͲ͖ͷPythonͷΫϩʔϥʔ։ൃ, requests-html͕ͤ • ScrapyΈ͍ͨʹԿͰग़དྷΔΘ͚͡Όͳ͍͚Ͳ ॳखͷಋೖίετͱ͔͍͠Φεεϝ. • Google
Cloud Functionsʢͬͯͳ͍͚ͲʣAWS LambdaͰ ࡶʹӡ༻͢Δͷʹ߹ͬͯΔͱࢥΘΕ. ۩ମྫ͍ͣΕϒϩάʹ.
ήʔϜηοτ⽁