Upgrade to Pro — share decks privately, control downloads, hide ads and more …

特徴量エンジニアリングと野球選手の成績予測 - 野球ではじめる機械学習 / Baseball ...

特徴量エンジニアリングと野球選手の成績予測 - 野球ではじめる機械学習 / Baseball Player Performance Prediction Using Feature Engineering with Machine Learning and Python

PyCon JP 2020 8/28 「スポーツデータを用いた特徴量エンジニアリングと野球選手の成績予測 - PythonとRを行ったり来たり」登壇資料

https://pycon.jp/2020/timetable/?id=203110

#Baseball #SABRmetrics #Python #MachineLearning #Datascience #PyConJP

Avatar for Shinichi Nakagawa

Shinichi Nakagawa

August 28, 2020
Tweet

More Decks by Shinichi Nakagawa

Other Decks in Science

Transcript

  1. Baseball Player Performance Prediction Using Feature Engineering with Python ⁶

    R Shinichi Nakagawa(@shinyorke) PyCon JP 2020 Online 8/28 εϙʔπσʔλΛ༻͍ͨಛ௃ྔΤϯδχΞϦϯάͱ໺ٿબखͷ੒੷༧ଌ - PythonͱRΛߦͬͨΓདྷͨΓ
  2. Who am I ?ʢ͓લ୭Αʣ • Shinichi Nakagawaʢத઒ ৳Ұʣ • େ఍ͷSNSͰʮshinyorkeʢ͠ΜΑʔ͘ʣʯͱ໊৐͍ͬͯ·͢

    • JX Press Corporation Senior Engineer ʢJX௨৴ࣾ γχΞɾΤϯδχΞʣ • Baseball Engineer, Data Scientist ʢ໺ੜͷ໺ٿΤϯδχΞɾσʔλαΠΤϯςΟετʣ • #Python #DataScience #Baseball⚾ #SABRmetrics #σʔλج൫ #ٕज़ސ໰
  3. JX௨৴ࣾͱPython • αʔόʔαΠυ, ػցֶश, SREͳͲͳͲPythonΛੲ͔Β࢖ͬͯ·͢ • PyCon JP͸εϙϯαʔΛԿ౓͔΍ͬͯ·͢. 2016, 2017,

    2019, 2020(New!) • ࠓ೥͸͞ΒʹεϐʔΧʔ͕ೋਓʂ(@YAMITZKY, @shinyorke) • Techϒϩάؤுͬͯ·͢, ಡΜͰͶ https://tech.jxpress.net/
  4. Python΋͘΋ࣗ͘शࣨ #jisyupy • ʮٕज़ͱΠΠΰϋϯʯΛָ͠Έͳ͕Βʮֶࣗࣗशʯ͢Δձ. • 2017೥ʹ #rettypy ͱͯ͠ελʔτ, ͣͬͱΦʔΨφΠβʔͯ͠·͢. •

    2020೥͔ΒΦʔΨφΠθʔγϣϯมߋͱ͔Ͱͪΐͬͱ͚ͩຯม. • ݱ࣌఺Ͱ͸ΦϯϥΠϯɾෆఆظ։࠵, ͍ͣΕϦΞϧ΋΍Γ͍ͨ. • ࣍ճ͸9/19 https://jisyupy.connpass.com/event/186611/
  5. ಛ௃ྔΤϯδχΞϦϯάͱ⚾ • ਺஋ -> ਺஋ • ͦͷ··࢖͑ΔϞϊ͕ଟ͍. ྫ͑͹҆ଧ, ࢛ٿ, ࡾৼͳͲ.

    • ਖ਼نԽɾεέʔϦϯά͢Δ. RC, wRAA, wOBAͳͲͷηΠόʔϝτϦΫεࢦඪ. • ਺஋Ҏ֎ͷσʔλ -> ਺஋ • ར͖࿹, ଧ੮ͷࠨӈ, etc… • બखͷಛ௃ͱͳΔ༗ޮͳσʔλΛԿ͔͠ΒͷܗͰ਺஋Խ.
  6. ⚾ʹ͓͚Δಛ௃ྔͷߟ͑ํ - DIPS, RC, LWTS • DIPS: ౤ଧͷϓϨʔΛʮࣗ੹ʯʮଞ੹ʯʹ෼ྨ͠ѻ͏ʢԼਤΛࢀরʣ • RC:

    ಘ఺ೳྗΛʮʢग़ྥೳྗ + ਐྥೳྗʣ / ग़৔ػձʯͷϞσϧͰઆ໌͢Δ΍Γํ • LWTS: ϓϨʔͷҰͭҰͭΛʮಘ఺ʯʹ׵ࢉ͠, ౤ଧͷϓϨʔΛධՁ͢Δ • ʮηΠόʔϝτϦΫεʯͱ͍͏໺ٿͷ౷ܭϞσϧతͳߟ͑ํͰ͢&ৄ͘͠͸ࢲͷϒϩάʹͯ https://shinyorke.hatenablog.com/entry/sabr-metrics-batting-stats ಛ௃ ओͳࢦඪ ࣗ੹ ݸਓͷೳྗʹґଘ ύϫʔ εϐʔυ બٿ؟FUDʜ ຊྥଧ ࡾৼ ࢛ࢮٿ ଞ੹ આ໌ม਺͕ଟ͍ νʔϜ ٿ৔ ৹൑FUDʜ ࣗ੹఺ʢ๷ޚ཰ʣ ࣦࡦ
  7. ಛ௃ྔΫοΩϯάΤϯδχΞϦϯάૣݟද • Python, R, SQLͰ΍ΕΔ͜ͱʹେࠩφγʢಘҙෆಘҙ͸͋Δʣ • هड़ྔ, ؔ਺ͷ໨త, ॲཧ଎౓ʢ෼ࢄॲཧͷ༗ແʣͰ࢖͍෼͚ ࢀߟɿ

    https://shinyorke.hatenablog.com/entry/r-to-python ൺֱ߲໨ 1ZUIPO 3 42- هड़ྔ ʢಉ͡ࣄΛͨ͠ͱͯ͠ʣ Մ΋ͳ͘ෆՄ΋ͳ͠ σʔλΛѻ͏఺Ͱ͸ ൺֱతγϯϓϧ 1ZUIPO 3ͱൺ΂ ΍΍৑௕ ؔ਺ ʢܭࢉॲཧʣ 001తͳΞϓϩʔνଞ ϓϩάϥϚϒϧͰ͋Δ ਺ࣜతͳϞσϧΛ ਺ࣜͷ··Ͱ͖Δ ؔ਺ͰϐλΰϥεΠον ࢖͍ํΛؒҧ͑Δͱ஍ࠈ ॲཧ଎౓ ෼ࢄॲཧ ฒྻԽɾ෼ࢄॲཧͰ ૯߹తͳνϡʔχϯάڧ͍ ͋͘·Ͱܭࢉɾ౷ܭπʔϧ ॲཧ଎౓ɾ෼ࢄ͸΄Ͳ΄Ͳ %#ΤϯδϯʹΑͬͯ ଎౓ɾ෼ࢄͷߟ͕͑ҟͳΔ
  8. ໺ٿͰ͸͡ΊΔػցֶश⚾ 1. Planning - ௐࠪɾاը 2. Data Engineering - σʔλऔಘ

    3. Feature Engineering - ಛ௃ྔநग़ 4. Clustering - ΫϥελϦϯάʹΑΔ෼ྨ 5. Predict - ੒੷༧ଌ ࣅͨλΠτϧͷຊ͕͋Δͩͱ? ؾͷ͍ͤͰ͢Αؾͷ͍ͤʢখ੠ʣ
  9. Analyzing Baseball Data with R • ηΠόʔϝτϦΫεΛ࢖ͬͨ໺ٿσʔλ෼ੳʹ͓͚Δఆ൪ຊ • ΞϝϦΧൃͰ2018೥ʹSecond Editionൃද,

    ΋ͪΖΜӳޠ • ໊લͷ௨Γ, ⚾σʔλ෼ੳͷຊͰ, ίʔυ͸͢΂ͯR த਎Λཧղ͢ΔͨΊ, RͷίʔυΛಡΈͳ͕ΒPythonʹࣸܦ
  10. R -> Pythonʹࣸܦͨ݁͠Ռ • RͰ΍Ζ͏͕PythonͰ΍Ζ͏͕݁Ռ͸มΘΒͳ͍, ͱཧղ • ࠓޙ࢖͏ϥΠϒϥϦʢ&ࣗ෼ͷशख़౓ʣߟ͑ͨΒPython • ͔͠͠,

    RͰ਺ࣜͱ͔ΊͪΌཧղͰ͖ͨͷͰRʹײँ ͪͳΈʹ౰࣌ͷ࡞ۀϩάʢ2019/11ʹ࣮ࢪʣ͸ϒϩάʹͯ͠·͢ https://shinyorke.hatenablog.com/entry/r-to-python
  11. ⚾Data is Ͳ͜& • Lahman’s Baseball Database • MLBશબखͷ௨ࢉ੒੷ɾग़਎ͷσʔλ. CSV੡.

    • http://www.seanlahman.com/baseball-archive/statistics/ • https://github.com/chadwickbureau/baseballdatabank • Retrosheet • ࢼ߹৘ใΛଧ੮୯ҐͰه࿥͍ͯ͠Δσʔληοτ. • https://www.retrosheet.org/ • https://github.com/chadwickbureau/retrosheet ݩʑ, shinyorke͕PyCon JP 2014Ҏདྷ͓ੈ࿩ʹͳ͍ͬͯͨσʔλͰ͢&࠷ۙ͸GitHubʹ΋͋ͬͯศརʂ
  12. ʲࢀߟʳBigQueryͷίετपΓ͸& • ͋͘·Ͱࢲͷܦݧ্Ͱ͕͢, ݸਓ։ൃͰ࢖͏ఔ౓ͷσʔλྔͩͬͨΒແྉ࿮ͷൣғ಺Ͱ࢖͑·͢. ※਺10GBఔ౓, 1ΫΤϦ͋ͨΓ਺100MB͙Β͍ͷར༻ • جຊΛकΕ͹اۀϨϕϧͰ΋ޮ཰త͔ͭ௒ϥΫʹ࢖͑·͢. GCPެࣜ&৭Μͳਓ͕ݴٴ͍ͯ͠·͢. •

    ແବͳྻΛऔಘ͠ͳ͍, σʔλ͸Ұׅૠೖ • partition key׆༻Ͱޮ཰తͳΞΫηε • ίετ؂ࢹ&ͳΜ͔͋ͬͨΒSlack౳Ͱ௨ใ • JX௨৴ࣾͰ΋BigQueryΊͬͪΌ׆༻͍ͯ͠·͢ https://tech.jxpress.net/entry/kowakunai-bigquery
  13. JupyterLab͔ΒBigQueryͰ͍͍ײ͡ʹ • JupyterLabΛϕʔεͱͨ͠؀ڥ • Pandas • scikit-learn • plotly •

    BigQuery Client͔Βͦͷ·· Dataframeʹ͍͍ͯ͠ײ͡ʹॲཧ • ͜ͷޙͷΫϥελϦϯάͱ͔͸શ෦͜Ε
  14. ANNʢۙࣅ࠷ۙ๣୳ࡧʣͰڑ཭ΛٻΊΔ • ώοτ, ຊྥଧ, etc…౳ͷ୅දతͳ੒੷. ৄࡉ͸ൿີ' • ग़৔ࢼ߹਺ͱ͔஍ຯͳ੒੷. ͜Ε΋ൿີ' •

    ্هΛಛ௃ྔͱͯ͠ANNʢۙࣅ࠷ۙ๣୳ࡧʣΛ͔ͭͬͯ ϢʔΫϦουڑ཭Λࢉग़͠, ͍ۙબखΛूΊΔ͜ͱʹ. • ଧऀ༧ଌͱ͸ผωλͰࢼ͠, ݁Ռ্ʑͩͬͨͷͰͦͷ··࠾༻ https://shinyorke.hatenablog.com/entry/feature-faridyu-san • ࣮૷͸Annoyͱ͍͏௒ศརͳϥΠϒϥϦΛ࢖͍·ͨ͠. • ࣮ݧίʔυΛ৮ͬͯյΕΔͱΞϨͳͷͰGitHub ActionsͰAuto Test
  15. ࣮ࡍͲ͏΍ͬͯ΍͔ͬͨ? 1. બखXʹࣅ͍ͯΔબखYʢෳ਺ਓʣͷ25ࡀ࣌఺ͷ੒੷ΛूΊΔ. 2. 1.ͷ੒੷σʔλΛݩʹ, ʮڧ͍ɾී௨ɾऑ͍ʯతͳlabelΛ෇͚Δ. 3. 1.Λtraining data, 2.Λlabelͱͨ͠෼ྨλεΫΛ࣮ࢪ

    4. બखXͷ25ࡀ੒੷σʔλΛ࢖ͬͯ༧ଌ. 5. ฦ͖ͬͯͨlabelͱಉ͡label͕෇͍ͨબखͷ੒੷Λݩʹ༧ଌ੒੷Λ࡞੒. ෼ྨ͸φΠʔϒϕΠζ, ࣮૷͸scikit-learnͰΤΠοͱ΍ͬͨʢίʔυ͸ׂѪʣ
  16. • ౷ܭɾػցֶशεΩϧΛຏ͘, ΤϯδχΞϦϯάɾεΩϧΛ ৳͹͢໨తͰʮݸਓ։ൃͰػցֶशʯΛڧ͓͘͢͢Ί͠·͢ʂ • ͜ͷൃද͸ۓٸࣄଶએݴதͷࣗॗظؒʹ΄΅΍Γ͖Γ·ͨ͠. ʮDone is better than

    perfectʯΛStay Homeظؒͷ͓͔͛Ͱ΍Εͨ. • ͪͳΈʹ, ݸਓ։ൃΛ࠳ંͤͣଓ͚Δ࿩͸ͪΐͬͱલʹॻ͍ͨ https://shinyorke.hatenablog.com/entry/botti-development σʔλαΠΤϯςΟετͦ͜ݸਓ։ൃΛ