Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python使いのためのスポーツデータ解析のきほん - PySparkとメジャーリーグデータを添えて #PyConJP 2022

Python使いのためのスポーツデータ解析のきほん - PySparkとメジャーリーグデータを添えて #PyConJP 2022

Shinichi Nakagawa

October 15, 2022
Tweet

More Decks by Shinichi Nakagawa

Other Decks in Technology

Transcript

  1. No Baseball, No Engineering!
    High Performance Data Platform
    Knowledge of PySpark, Cloud and ⚾
    Python࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯ
    Shinichi Nakagawa@shinyorke 2022/10/15 PyConJP 2022 Talk Session

    View full-size slide

  2. Onboardingʢ͜ͷηογϣϯͷ͝Ҋ಺ʣ
    • PythonͱSparkʢPySparkʣͱύϒϦοΫΫϥ΢υʢGoogle CloudʣͰ
    ਺GBҎ্ͷσʔλΛ͍͍ײ͡ʹॲཧͯ͠ѻ͓͏ͥʂ, ͱ͍͏τʔΫͰ͢.
    • ಺༰తʹ͸தڃऀʙ্ڃऀ޲͚Ͱ͢, ॳ৺ऀͷํͷࢦ਑ʹͳΔͱخ͍͠Ͱ͢.
    ʢ㲈Θ͔Βͳ͍ɾ஌Βͳ͍͜ͱ͸ࣗ͝෼ͷʮ৳ͼ͠Ζʯͩͱࢥ͍ͬͯͩ͘͞ʣ
    • σʔλͷ୊ࡐ͸ʮϝδϟʔϦʔάʯͰ͢⽁, εϙʔπσʔλͷ࿩΋গ͠.
    • ໺ٿʹڵຯͳ͍ʢor޷͖͡Όͳ͍ʣํͱ΋Ұॹʹָ͠ΊͨΒ޾͍Ͱ͢.
    ࠓ೔ͷτʔΫΛ͖͔͚ͬʹ໺ٿʹڵຯ΋ͯΔΑ͏ͳ࿩Λؤுͬͯ΍Γ·͢"

    View full-size slide

  3. օ༷ʹظ଴͢Δલఏ஌ࣝͱϞνϕʔγϣϯ
    • ʲMustʳPandas΍SQLͰσʔλॲཧɾ෼ੳΛखΛಈ͔ͯ͠΍ͬͨ͜ͱ͕͋Δ.
    • ʲMustʳGoogle CloudʢGCPʣ, AWS, AzureͳͲͷPublic CloudͰ
    PythonΛ࢖ͬͨ͜ͱ͕͋Δ. ※αʔϏε͸໰ΘͣʢEC2, App Engine, etc…ʣ
    • ϑϧϚωʔδυͷαʔόϨε؀ڥͰͷ։ൃܦݧʢ৮ͬͨ͜ͱ͋Ε͹OKʣ.
    AWS Lambda, AWS App Runner, App Engine, Cloud RunͳͲ͕֘౰.
    • ʢ޷͖ݏ͍ؔ܎ͳ͘ʣ໺ٿͷϧʔϧͱΦΦλχαϯ͸೺Ѳ͍ͯ͠Δ.

    View full-size slide

  4. Who am ɹ?
    ʢ͓લ୭Α?ʣ
    • Shinichi Nakagawa@shinyorke
    • େख֎ࢿITίϯαϧاۀϚωʔδϟʔ
    ʢݩɾࣄۀձࣾͷϑϧαΠΫϧΤϯδχΞʣ
    • Ϋϥ΢υΠϯϑϥΛѻ͏νʔϜͷϚωʔδϟʔ
    • झຯͱ࣮ӹΛ݉Ͷͯݸਓ։ൃͯ͠·͢#
    ʢओʹ໺ٿͱϑΟδΧϧέΞ໨తʣ
    • ໺ٿͱҿΈͳ͕Βͷϓϩάϥϛϯάେ޷͖.
    • ਪ͠: ৽ঙ߶ࢤ, ສ೾தਖ਼, ୩઒ݪ݈ଠʢͷڧݞʣ
    #Python #Serverless #GoogleCloud #Baseball
    #DataScience #SABRmetrics

    View full-size slide

  5. ຊ೔ͷελʔςΟϯάϝϯόʔ
    • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏
    • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫
    • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ
    • ໺ٿϏοάσʔλ͕ਪ͢ʮΤά͍ʓʓͨͪʯ

    View full-size slide

  6. ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏

    View full-size slide

  7. ϝδϟʔϦʔάͷϏοάσʔλ
    • ϝδϟʔϦʔά͸ʮStatcastʯͱ͍͏γεςϜͰ৭ΜͳσʔλΛه࿥͍ͯ͠·͢.
    ※ΧϝϥɾϨʔμʔͰه࿥, Ұ෦౷ܭ஋ɾਓྗͰه࿥
    • ྫ͑͹, ͜ͷลͷ࣮گͷݩωλ͸͢΂ͯ͜ͷʮStatcastʯͱ͍͏Ϗοάσʔλ͕ݩωλʹͳ͍ͬͯ·͢.
    • ΦΦλχαϯʂ˓߸ຊྥଧʂଧٿ଎౓180km/h, ඈڑ཭130m
    • ΦΦλχαϯʂ162km/hͷਅͬ௚͙Ͱݟಀ͠ࡾৼʂʂʂ
    • ໺ٿͷҰڍखҰ౤଍, ͢΂ͯͷ౤ٿɾଧٿσʔλ͕ه࿥͞ΕΔ.
    • ϨΪϡϥʔγʔζϯʢ30νʔϜɾ162ࢼ߹ʣͰ͓͓Αͦ70ʙ80ສٿલޙ. ϙετγʔζϯɾय़Ωϟϯϓσʔλ΋͋Δ.
    • σʔλ͸91ݸͷ߲໨ʢ!?ʣͰߏ੒͞ΕΔ, ϨΪϡϥʔγʔζϯ෼Ͱ͓͓Αͦ400MBʙ600MB͙Β͍ͷσʔλ.
    • baseballsavant.mlb.com ͱ͍͏αΠτͰ୭Ͱ΋Ӿཡɾμ΢ϯϩʔυʢCSV ϑΥʔϚοτʣͰ͖·͢.

    View full-size slide

  8. σʔλͷ࢓༷ʢެࣜʣ͸ͪ͜Β.
    https://baseballsavant.mlb.com/csv-docs
    ࢲͷղઆɾ຋༁൛͸ͬͪ͜.
    https://shinyorke.hatenablog.com/
    entry/statcast-csv-docs-ja
    ֤σʔλ߲໨, νϥοͱ͓ݟͤ͠·͢.

    View full-size slide

  9. ???ʮਏ͍Ͱ͢…߲໨ͱҙຯ͕Θ͔Βͳ͍͔Β.ʯ
    શ91߲໨, ୯Ґͱ͔ଌఆج४΋ॳݟࡴ͠Ͱ͢ʢ&৽Ҫ͞Μ޿ౡ؂ಜब೚͓Ίʣ

    View full-size slide

  10. StatcastσʔλͰৼΓฦΔʮΦΦλχαϯͷ2022೥ʯ
    ͪ͜ΒΛྫʹStatcastσʔλΛݟ͍͖ͯ·͠ΐ͏.

    View full-size slide

  11. https://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022
    ্هͷStatcastΛ࢖ͬͨαϯϓϧΛݩʹղઆ͠·͢&ίʔυެ։ͯ͠ΔͷͰͥͻ༡ΜͰ͍ͩ͘͞.
    ※ΦϦδφϧσʔλ͸mile/h & feetͰ͕͢, ࣄલʹkm/h & mʹม׵ࡁΈʢΦϦδφϧσʔλʹ͸ແ͍ͷͰ஫ҙʣ.

    View full-size slide

  12. 2022೥ͷΦΦλχαϯ,
    εϥΠμʔͱ2γʔϜ,
    ΧοτϘʔϧܑ͞ΜʹͳΔ
    • ࠓ೥ͷΦΦλχαϯ, ΊͬͪΌ
    εϥΠμʔ౤͍͛ͯΔ
    • ͓ؾ͖ͮͩΖ͏͔?ޙ൒ઓ͸
    2γʔϜʢσʔλ্͸Sinkerʣ͕
    ૿͍͑ͯΔ͜ͱʹ!?
    • εϥΠμʔ, 2γʔϜ, ΧοτϘʔϧͰ
    บ͕ڧ͍ۂ͕Γٿ౤͛ΔϚϯʹΩϟϥม

    View full-size slide

  13. ͱ͋ΔΦΦλχαϯͷొ൘೔ʢ2022/9/29, 8ճ10ୣࡾৼແࣦ఺ʣ
    ൒෼ۙ͘εϥΠμʔΛ౤͛ͯ2γʔϜͱΧοτͰԡ͍ͯ͘͠Πϝʔδ
    ౤͛ͨ৔ॴʢัख໨ઢʣ ϦϦʔεϙΠϯτʢัख໨ઢʣ
    ٿछͷׂ߹

    View full-size slide

  14. StatcastσʔλΛJupyter Lab + PlotlyͰோΊΔ
    • ৭Μͳσʔλ͕͋ΔͷͰ݁ߏͳ͜ͱ͕Θ͔Γͦ͏.
    • ࣌ܥྻσʔλͳͷͰ, ύϑΥʔϚϯεͷมԽ΋͔ͭΊΔ.
    • ٿ଎͕લͱҧ͏ͱ͔, ಥવ2γʔϜ૿͑ͨͳ?ͱ͔.
    • Ϙʔϧͷ଎౓ɾ࠲ඪܥσʔλ͕ἧ͍ͬͯΔ.
    • ؤுͬͯ࠲ඪܭࢉ͍͍ͯ͠ײ͡ʹͨ͠Β3Dඳըͱ͔͍͚Δ.
    ʢҙ༁ɾࠓճ͸ؤுΔ༨༟ͳ͔ͬͨͷͰ΍ͬͯ·ͤΜ$ʣ

    View full-size slide

  15. ϫΠʮຖ೔ຖࢼ߹ݟΔ࢓૊Έཉ͍͠ʯ
    https://baseballsavant.mlb.com/ ͕ඍົʹ࢖͍ʹ͍͘ࣄ΋͋Γ…w
    ࢖͍΍͍͢σʔλج൫ʹͪ͠Ό͑ʂͱ͍͏ΞΠσΞ͕͋Δ೔ࢥ͍ͭ͘.

    View full-size slide

  16. ͱ͍͏Θ͚Ͱ, ͪΐͬ͜ͱ࡞ͬͯΈ·ͨ͠.

    View full-size slide

  17. PythonͱGoogle CloudͰ࡞Δ
    αʔόϨεͰ͍͍ײ͡ͳ
    σʔλج൫ʢ໺ٿฤʣ

    View full-size slide

  18. ΞʔΩςΫνϟͷશମ૾

    View full-size slide

  19. ΞʔΩςΫνϟղઆʢ㲈ͩ͜ΘΓϙΠϯτʣ
    • ຖ೔σʔλ֬ೝɾຖ೔σʔλߋ৽Λ͍͍ײ͡ʹ࣮ݱ͢ΔͨΊ,
    ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯΛશ໘తʹ׆༻ͯ͠ߏஙɾӡ༻.
    • ʮϑϧϚωʔδυͳαʔόϨεܥΫϥ΢υαʔϏεʯ #ͱ͸
    • CLI΍ίϯιʔϧͰϙνϙν͢Δ͚ͩͰͻͱ·্ཱ͕ͣͪΔ
    • Πϯϑϥɾαʔόʔͷϝϯςφϯε͕ෆཁʢࣗ෼͡Όͳͯ͘, Ϋϥ΢υαʔϏεଆ͕΍Δʣ
    • ΑΓ۩ମతʹ͸, ࣗ෼ͰK8sΫϥελ΍VMΛݐͯͳͯ͘΋ྑ͍ʢωοτϫʔΫ౳ͷઃఆ͸ൃੜʣ
    • GitHub Actions౳ͷCI/CDͷύΠϓϥΠϯʹ૊ΈࠐΜͰσϓϩΠɾεέʔϧͰ͖ͨΓ
    جຊతʹ͸ʮ࢖ͬͨ෼͚ͩ՝ۚʯʹͳΔͷͰ͓ࡒ෍ʹ΋༏͍͠%

    View full-size slide

  20. Ϣʔεέʔεͱ࢖ͬͨαʔϏε

    View full-size slide

  21. μογϡϘʔυΞϓϦ
    • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε
    • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ
    • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

    View full-size slide

  22. σʔλऩू&BigQueryอଘ
    • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ
    • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ
    • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ

    View full-size slide

  23. Firestore౤ೖʢDatabaseʹσʔλҠૹʣ
    • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
    • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ
    • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

    View full-size slide

  24. PySpark + DataprocͰ࣮ݱ͢Δ
    αʔόϨεͳσʔλॲཧ
    ※͕͜͜͜ͷτʔΫͷຊ୊ͱͳΓ·͢.

    View full-size slide

  25. ͜ͷ࿩ͷείʔϓ

    View full-size slide

  26. 33.4ඵͰΘ͔ͬͨʢؾʹͳΔʣ&
    SparkͱPySpark

    View full-size slide

  27. SparkͱPySpark
    • ʮେ͖͍σʔλΛ͍͍ײ͡ʹ෼ࢄͯ͠ॲཧ͢ΔʯͨΊͷFramework
    • Sparkຊମͷ࣮૷͸Java͕ͩ, PythonͷInterfaceͰ͋ΔʮPySparkʯΛ
    ࢖͏ࣄ͕ଟ͍ʢଞͷݴޠͩͱR΋࢖͑ͨΓ͢Δʣ.
    • σʔλॲཧόονͷϓϩάϥϜͱͯ͠ಈ͔͢ or Jupyter Lab, ZeppelinͰnotebook࣮ߦ.
    • Python࢖͍ʹ͸ඇৗʹೃછΈ͕ਂ͍, DataFrameͳInterface͕͋Δ.
    • SparkಠࣗͷDataframe. ͪͳΈʹPandas Dataframeʹม׵Մೳ
    • Pandas APIʢSpark 3.2Ҏ߱ͰPandasͷػೳΛར༻Մೳ, Ұ෦੍໿༗Γʣ

    View full-size slide

  28. SparkΛͲ͜Ͱߏஙɾӡ༻͢Δ͔'
    ؀ڥɾखஈ ߏஙͷखؒ ӡ༻͠΍͢͞ උߟ
    ΦϯϓϨϛεͰ
    શͯࣗલߏஙɾӡ༻
    શͯࣗલͰઃఆ͢Δ
    ඞཁ͕͋Δ
    Կ͔ΒԿ·Ͱ
    ࣗ෼ͰݟΔඞཁ͕͋Δ
    Ұ൪େมͳύλʔϯ
    ຊ৬ͷΠϯϑϥΤϯδχΞ
    Ͱ΋͖͍ͭ࢓ࣄ
    Ϋϥ΢υ্ͷ7.,Tʹ
    ࣗલͰߏஙɾӡ༻
    શͯࣗલͰઃఆ͢Δ
    ඞཁ͕͋Δ
    ͋Δఔ౓Ϋϥ΢υαʔϏε
    ͷԸܙʹत͔ΕΔ
    4QBSL؀ڥͷࣗલߏங͸
    ׂͱ೉қ౓͕ߴ͍
    Ϋϥ΢υαʔϏεఏڙͷ
    ϚωʔδυαʔϏεΛ࢖͏
    ˞࠷΋ਪ঑͢Δํࣜ
    (6*Ͱϙνϙν͢Δ
    ΋͘͠͸$-*"1*Ͱ
    ͍͍ײ͡ʹ࣮ߦ
    $16౳ͷϦιʔεΛ؂ࢹ
    ঢ়گʹԠͯ͡ϝϯςφϯε
    ࠷΋ָ͔ͭεϚʔτͳํ๏
    "84 (PPHMF$MPVEଞ
    ֤ࣾαʔϏε༗

    View full-size slide

  29. Google Cloudʹ͓͚ΔSparkӡ༻ͷબ୒ࢶ
    ؀ڥɾखஈ ߏங ӡ༻ ࢖͑Δػೳ උߟ
    ($&PS(,&ʹ
    ؀ڥΛ࡞ͬͯӡ༻
    ࣗલͰߏஙޙ
    4QBSLΛಋೖ
    શͯࣗલͰӡ༻
    ໘౗ΛݟΔඞཁ༗
    શͯͷػೳ
    ݁ہͷॴ%BUBQSPDͰ
    ग़དྷΔ͜ͱͳͷͰ
    ͓͢͢Ί͠ͳ͍
    %BUBQSPD
    HDMPVEίϚϯυ
    "1* ίϯιʔϧͷ
    ͲΕ͔Ͱߏங
    %BUBQSPD͕࡞ͬͨ
    (,&PS($&؀ڥ
    Λ؂ࢹɾӡ༻
    શͯͷػೳ Ұ൪ඪ४తͳߏ੒
    %BUBQSPD
    4FSWFSMFTT
    HDMPVEίϚϯυ
    "1* ίϯιʔϧ
    ্هͷͲΕ͔Ͱߏங
    ࣮ߦதͷ؂ࢹͷΈ
    ؀ڥ͸ॲཧޙʹ
    ࣗಈ࡟আ
    όονॲཧͷΈରԠ
    OPUFCPPL࢖͑ͳ͍
    ఆظతͳόονॲཧ
    ͸͜Ε͕Ұ൪͍͍
    ※Spark in BigQueryͱ͍͏, BigQueryͷετΞυͱͯ͠SparkΛ࣮ߦ͢Δػೳͷఏڙ༧ఆ༗Γʢby Google Cloud Next ‘22ʣ

    View full-size slide

  30. DataprocͱDataproc Serverless
    • Google Cloudʹ͸Dataprocͱ͍͏SparkʢHadoopʣͷϚωʔδυαʔϏε͕ଘࡏ͢Δ.
    • ࠓ·Ͱ͸GCE΍GKEʢK8sʣͰʮϗετ΋͘͠͸Cluster͕ଘࡏʯલఏͷ
    ӡ༻͔͠Ͱ͖ͳ͔͕ͬͨ, ͍ͭ࠷ۙServerlessͱ͍͏બ୒ࢶ͕ര஀
    • ʮ1೔1ճʯʮ30෼͓͖ʯΈ͍ͨͳόονӡ༻Ͱ͋Ε͹Serverless͕࢖͑Δʂ
    ͳ͓, notebookͷ࣮ߦʢJupyterͳͲʣ͸ະରԠͳͷͰΞυϗοΫʹ͸࢖͑ͳ͍.
    • Serverless͸࢖ͬͨ෼͚ͩ՝ۚͳͷͰ͓ࡒ෍ʹ΋༏͍͠%
    • όʔδϣϯ͸Spark 3.2, PySpark͔ΒPandas API࢖͑·͢ʢ͕ࠓճ͸࢖ͬͯ·ͤΜʣ.

    View full-size slide

  31. PySparkΛ࢖ͬͯ΍ͬͨλεΫΛ঺հ
    • σʔλऩू&BigQuery΁ͷσʔλ౤ೖ
    • μογϡϘʔυΞϓϦ༻DBʢFirestoreʣ΁ͷσʔλ౤ೖ

    View full-size slide

  32. ʲ࠶ܝʳσʔλऩू&BigQueryอଘ
    • σʔλݩαΠτʢBaseball Savantʣ͔Βఆظతʹσʔλऩू͢ΔΫϩʔϥʔʢCloud Functionsʣ࣮ߦ
    • ࣮ߦ݁Ռ͸Google Cloud StorageʢGCSʣʹCSVͱͯ͠อଘ. ͜Ε͕ݯઘͷσʔλʢDatalakeʣ
    • GCS্ͷCSVΛαϚϦʔ͍͍ͯ͠ײ͡ʹͯ͠BigQueryʹอଘ͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ

    View full-size slide

  33. σʔλऩू
    ʢnot Sparkʣ
    • WebεΫϨΠϐϯά͸SparkͰ
    ΍Δ΂͖͜ͱͰ͸ͳ͍.
    • λεΫΛrequests-htmlͰ࣮૷,
    Cloud FunctionsͰӡ༻ͯ͠ରॲ.
    • Cloud SchedulerͷCronઃఆͰ
    ఆظ࣮ߦ, GCSʹอଘ

    View full-size slide

  34. CSVσʔλΛ
    BigQueryʹ౤ೖ
    • Dataproc্Ͱ΍ΔλεΫͱͯ͠
    ద੾ͳൣғɾॲཧͷҰͭ
    • GCSͷύε͔ΒϑΝΠϧநग़
    Spark SQLͰॲཧͯ͠BigQuery΁
    • DataFrameͱSQL͕Θ͔Ε͹
    ͍͍ײ͡ʹ࣮૷ɾӡ༻Մೳ

    View full-size slide

  35. DataprocΛ࢖͓͏
    • Google CloudͷυΩϡϝϯτɾαϯϓϧΛࣸܦ͠ͳ͕Β΍Δͱྑ͖
    • https://cloud.google.com/dataproc
    • https://cloud.google.com/dataproc-serverless/docs
    • https://github.com/GoogleCloudDataproc/cloud-dataproc
    • Serverlessͷ৔߹, ࣄલʹVPC subnetΛ࡞੒, ࣮ߦ࣌ʹࢦఆ͢Δඞཁ͋Γ.
    • ࣍ϖʔδ͔Β, PySparkΛ࢖ͬͯ΍Δ৔߹ͷαϯϓϧΛগ͠঺հ͠·͢.
    • Spark DataFrameΛݩʹ, ʮσʔλΛಡΜͰՃ޻ͯ͠ॻ͖ࠐΈʯతͳόονॲཧ.
    • ͲͷΫϥε͔Θ͔Γ΍͘͢͢ΔͨΊ, Type Hints෇͖Ͱ࣮૷͍ͯ͠·͢.

    View full-size slide

  36. ͻͱ·࣮ͣ૷
    1. SessionΛ࡞Δ
    • DB connectionతͳ΍ͭ
    • SparkSessionͷObjectΛ࡞Δ
    • Object࡞੒࣌ʹParameterࢦఆ
    • BigQueryΛ࢖͏࣌͸
    JARͷࢦఆ͕ඞਢͳͷͰ஫ҙ

    View full-size slide

  37. ͻͱ·࣮ͣ૷
    2. SchemaΛ࡞Δ
    • CSVͷ৔߹SchemaΛ࡞Δ
    • ࡞੒͞ΕΔDataframeʹ
    ܕΛ͚ͭΔҝ, ઈରඞཁ
    • ࠓճ͸91߲໨෼ͷSchema
    ؤுͬͯॻ͖·ͨ͠ྦ

    View full-size slide

  38. ͻͱ·࣮ͣ૷
    3. CSVಡΈࠐΉ
    • sparkηογϣϯͷreadΛ
    ࢖͏, formatʹCSVΛࢦఆ
    • ϔομʔͱͯ͠ઌ΄Ͳͷ
    SchemaΛࢦఆ
    • GCSͷϑϧύεΛࢦఆ

    View full-size slide

  39. ͻͱ·࣮ͣ૷
    4. BigQueryอଘ
    • DataFrameͷwriteؔ਺
    • bigqueryΛࢦఆ
    • ྫ͸طଘςʔϒϧ΁ͷ
    ௥هॻ͖ࠐΈ

    View full-size slide

  40. Dataproc ServerlessΛ࢖࣮ͬͯߦ

    View full-size slide

  41. BigQuery͔ΒGCSʹϑΝΠϧग़ྗ for Dataproc
    • BigQueryͷσʔλΛSpark DataFrameʹ
    • Spark DataFrameΛϑΝΠϧग़ྗ
    ͪͳΈʹ࣮ߦํ๏ʢgcloud CLIʣ͸มΘΒͳ͍ͷͰׂѪ͠·͢.

    View full-size slide

  42. ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ
    • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
    • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ
    • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

    View full-size slide

  43. ʲ࠶ܝʳFirestore౤ೖʢDatabaseʹσʔλҠૹʣ
    • BigQueryσʔλΛμογϡϘʔυ༻σʔλͷܗࣜʢJSONʣʹม׵͢ΔPySparkεΫϦϓτΛDataproc Serverless্Ͱ࣮ߦ
    • ࣮ߦ݁ՌʢGCS্ʹJSONܗࣜͰอଘʣΛFirestoreʹೖΕΔͨΊͷPythonεΫϦϓτΛ࣮ߦ
    • ͳ͓͍ͣΕ΋खಈͰͷ࣮ߦʢཧ༝&ରԠࡦ͸ޙ΄Ͳʣ

    View full-size slide

  44. ͻͱ·࣮ͣ૷
    5. BigQueryಡࠐ
    • อଘͱಉ͘͡BigQueryͷ
    JARΛࢦఆ
    • spark readͰBigQueryΛࢦఆ
    • BigQueryͷViewʹରͯ͠
    ߦ͏৔߹, Φϓγϣϯ͕ඞཁ

    View full-size slide

  45. ͻͱ·࣮ͣ૷
    6. GCSอଘ
    • DataFrameͷwriteؔ਺
    • jsonΛࢦఆ
    • ࠷ऴతͳύεΛࢦఆ

    View full-size slide

  46. PySparkͱDataproc Serverless
    • ʮ࢖͍͍ͨͱ͖͚ͩSparkΛ࢖͏ʯͱ͍͏ϢʔεέʔεΛ࣮ݱՄೳ.
    ͜Ε͕αʔόϨεαʔϏεΛ࢖͏΂͖࠷େͷཧ༝.
    • ࠓճͷΞϓϦέʔγϣϯͷσʔλαΠζʢ1೥Ͱ1GB͍͔ͳ͍ʣͩͱ
    Ըܙʹत͔Εͳ͍͕, ʮ਺GB/೔ఔ౓ͷσʔλΛαΫοͱόονॲཧʯ
    Έ͍ͨͳϢʔεέʔεʹͳΔͱ݁ߏศརͳؾ͕͠·͢ʢલॲཧɾΫϨϯδϯά͢Δͱ͔ʣ.
    • ʮॲཧ͢Δͱ͖͚ͩಈ͔͢ʯͱ͍͏ײ͡ͷ͍ܰίʔυͳͷͰPySparkͱ΋૬ੑόπάϯ.
    • ͳ͓, ॲཧͷࣗಈԽ͸ͪΐͬͱบ͕͋Γ·͢, Cloud ComposerʢAirflowʣ͕ඞཁ.
    ※ৄࡉ͸౰ࢿྉͷAppendixΛࢀর

    View full-size slide

  47. ٕज़ύʔτ͸͜͜Ͱऴྃ.
    ࠑॲ͔Βઌ͸…

    View full-size slide

  48. ΍͖͏ͷ͔͡Μͩ͋͋͋͋͋⽁

    View full-size slide

  49. 2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5
    1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄
    2. FIGHTERS GIRL 2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ
    ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ
    3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ
    4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ
    5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?

    View full-size slide

  50. 2022೥ϓϩ໺ٿ, άοͱ͖ͨग़དྷࣄBEST 5
    1. ϑΝΠλʔζ, ສ೾ɾਗ਼ٶɾాٶΒ, ਎ମೳྗ༏ΕΔएख͕୆಄
    2. FIGHTERS GIRL 2022, ΩϨοΩϨͷΩπωμϯε͕େώοτ
    ύɾϦʔάTVͷಈը࠶ੜ਺Ͱଟ਺ͷ࢝ٿࣜಈըʹ΋ѹউ
    3. ύϫʔͱڧݞ, ढ़଍޷कͰҰ࣌୅Λங͍ͨࢳҪՅஉ, ෱ཹ޹հͷҾୀ
    4. ٿ৔ʹډ࠲Δ໺ੜͷௗ, όοτΛৼΓճͨ͠ઍ༿ϩος֯த֎໺खʹෛ͚Δ
    5. ࠤʑ໦࿕ر׬શࢼ߹, ଜ্फོࡾףԦ + ຊྥଧه࿥, Τά͘ͳ͍Ͱ͔͢?

    View full-size slide

  51. Statcast ʢ&ࢲʣ͕ਪ͢
    ʮΤά͍֎໺खͨͪʯ
    • ਎ମೳྗ͓Խ͚ͰΩϨοΩϨ
    • ύϫʔ, ڧݞ, ޷कͦͯ͠٭͕ചΓ
    • όοτΛৼΓճ͢໺ੑͬΆ͞
    • ଧٿ଎౓Λݩʹਪ͠Λ3ਓ঺հ
    • ݱ໾࣌୅ͷ৽ঙ߶ࢤͬΆ͍
    Ӊ஦ਓૉ੖Β͍͠֎໺खͰ͢(

    View full-size slide

  52. ຊ೔͝঺հ͢ΔΤά͍֎໺खͨͪ
    • Judge, AaronʢΞʔϩϯɾδϟοδʣ
    • Rodríguez, JulioʢϑϦΦɾϩυϦήεʣ
    • Buxton, ByronʢόΠϩϯɾόΫετϯʣ
    300ଧ੮Ҏ্ཱ͍ͬͯΔ֎໺ख͔ͭ, ଧٿ଎౓͕଎ͯ͘௕ଧ͕ग़·͘Δ,
    ݪଇηϯλʔΛक͍ͬͯΔબखΛ3ਓ঺հ͠·͢.

    View full-size slide

  53. Ξʔϩϯɾδϟοδ
    ʢ2022೥ຊྥଧԦʣ
    • ϠϯΩʔεͷڧଧऀͰ,
    ΦΦλχαϯͷϥΠόϧ
    • ݱ໾࠷ڧͷϗʔϜϥϯόολʔ
    • ͨͩύϫʔ͕͋Δ͚ͩͰͳ͘
    2mͷ਎௕Λੜ͔ͨ͠֎໺कඋ
    ηϯλʔकΕΔػಈྗ͕ചΓ

    View full-size slide

  54. ϑϦΦɾϩυϦήε
    ʢγΞτϧظ଴ͷ੕ʣ
    • ϚϦφʔζʹᰜ૘ͱݱΕͨظ଴ͷ੕
    ͪͳΈʹࠓ೥ͷϧʔΩʔ
    • एख࣌୅ͷBIG BOSSΈ͍ͨͳ੒੷
    ਎ମೳྗΛੜ͔ͨ͠ϓϨʔ͕ັྗ
    • ଧٿ֯౓্͕͕ͬͯόϨϧ૿͑ͨΒ
    Πνϩʔࢯʹগͣͭۙͮ͘͠ͷͰ͸?
    10೥ܖ໿ʹԠ͑Δ׆༂Λظ଴ʂ

    View full-size slide

  55. όΠϩϯɾόΫετϯ
    ʢϛωιλͷສ೾ʣ
    • ϛωιλɾπΠϯζෆಈͷηϯλʔ
    • ໺ٿ͡Όͳ͍ڝٕ΋ߦ͚ͦ͏?
    ͱ͍͏Τήπͳ͍٭ྗͱݞͷ࣋ͪओ,
    ͦͷׂʹଧٿ֯౓͕ύϫʔώολʔ
    • ৭ʑࡶͬΆ͍ॴͱελΠϧͷྑ͞Ͱ
    ສ೾தਖ਼ʢϑΝΠλʔζʣʹࣅ͍ͯΔ.
    Ϛϯνϡ΢, ๺ͷόΫετϯʹͳͬͯ͘Εʂ

    View full-size slide

  56. ࠓ೥͸֎໺͸कͬͯ·ͤΜ͕.
    ͜ͷํ΋΍͸ΓΤά͍όολʔͰͨ͠

    View full-size slide

  57. ΦΦλχαϯʂʂΩϡϯͰ͢ὑ
    300ଧ੮Ҏ্ͷ࠷ߴଧٿ଎౓ϥϯΩϯά, 2ҐͰͨ͠ʢࢲௐ΂ʣ

    View full-size slide

  58. ʲ࠶ܝʳຊ೔ͷελʔςΟϯάϝϯόʔ
    • ϝδϟʔϦʔάͷϏοάσʔλͰ༡΅͏
    • PythonͱGoogle CloudͰ࡞ΔαʔόϨεͰ͍͍ײ͡ͳσʔλج൫
    • PySpark + DataprocͰ࣮ݱ͢ΔαʔόϨεͳσʔλॲཧ
    • Ϗοάσʔλ͕ਪ͢ʮΤά͍ΞεϦʔτܥ֎໺खʯ
    ָ͓͠Έ͍͚ͨͩ·͔ͨ͠?৘ใྔ͕ଟ͔ͬͨͷͰཧղ͢Δ·Ͱ೉͍͔͠΋׼
    ࢿྉ͸ެ։͠·͢ͷͰ, ͥͻৼΓฦΓͱ͓ͯ͠ಡΈ͍ͩ͘͞)

    View full-size slide

  59. ࠓ೔ͷ࿩Λཁ໿͢Δͱ…
    • εϙʔπσʔλͷղੳɾ෼ੳͷ͓୊໨ͱͯ͠໺ٿ͸໘ന͍Αʂ
    Baseball Savantͱ͍͏τϥοΩϯάσʔλΛ࢖͏ͱྑ͖.
    • PythonͰ͍͍ײ͡ʹσʔλॲཧΛ͢ΔͷʹPySpark͸ྑ͍ͧ.
    • PySpark͸Ϋϥ΢υͰಈ͔ͤ·͢, ࠓ೔͸DataprocΛ঺հ͠·ͨ͠.
    • αʔόϨεʹΫϥ΢υΛ࢖͑ΔΑ͏ʹͳΔͱ,৭ʑͱָʢ੍ͨͩ͠ݶ΋͋Δʣ.
    • ϝδϟʔ͸Τά͍֎໺ख͕͍Δ͕, εϥΠμʔͱ2γʔϜ͓Խ͚ͷΦΦλχαϯڧ͍.

    View full-size slide

  60. ͓࢓ࣄʢۀ຿ʣͰࢀߟʹ͠Α͏ͱࢥͬͨํ΁
    • ࠓճ঺հͨ͠΍Γํɾߏ੒͸ઈରతͳճ౴ɾϕεϓϥͰ͸ͳ͍Ͱ͢.
    ྫ͑͹αʔόϨεɾΞʔΩςΫνϟʹ͢΂͖/͢΂͖͡Όͳ͍ঢ়گ͸࣮֬ʹଘࡏ͠·͢.
    • ͜ͷ࿩͸ࢲʢshinyorkeʣ͕΍Γ͍ͨࣄ, ͍͍ͱࢥͬͯΔࣄʢ&৮Γ͍ͨϞϊʣΛ
    ٧ΊࠐΜͰ࡞ͬͨ, ࣗ෼͕΍Γ͍ͨࣄͷूେ੒Ͱ, ͋͘·Ͱ౴ͷग़͠ํͷҰͭͰ͢.
    • ΋ͬͱݴ͑͹, ʮαʔόϨεͱ͔PySparkͰͲ͜·Ͱ͍͍ײ͡ʹ࡞ΕΔ?ʯͱ͍͏
    ϓϩτλΠϓͱͯ͠࡞Γ·ͨ͠&࣮͸ࠓޙSpark͸֎ͭ͢΋ΓͰ͢ʢৄ͘͠͸Appendixʹͯʣ.
    • ʢίϯςΩετͷཧղ͕த్൒୺ͳ··ʣͦͷ··ਅࣅ͢Δͱരࢮ͠·͢.
    ·ͣ͸खΛಈ͔͠, ֶशͨ͠Γಈ͔ͨ͠Γ͍͍ͯ͠΋ͷΛݟ͚ͭΔࢀߟʹͲ͏ͧʂ

    View full-size slide

  61. ʲଓ͖ʳAppendix - ΋͏ͪΐͬͱৄ͍͠࿩
    • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ
    • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022
    • SparkΛ࢖Θͳ͍, େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud
    • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ
    ؾʹͳΔํ͸ࢿྉͷଓ͖ΛಡΜͰ&ձ৔ͷํ͸࣭ٙԠ౴Ͱ࿩͠·͠ΐ͏ʂ

    View full-size slide

  62. ͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠⽁
    Shinichi Nakagawa@shinyorke

    View full-size slide

  63. Python࢖͍ͷͨΊͷεϙʔπσʔλղੳͷ͖΄Μ - PySparkͱϝδϟʔϦʔάσʔλΛఴ͑ͯ
    ΦϚέฤʮຊฤͰ͸࿩͞ͳ͔ͬͨTips&ࢀߟࢿྉΛҰؾʹެ։͠·͢ʯ

    View full-size slide

  64. Appendix - ΋͏ͪΐͬͱৄ͍͠࿩
    • Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ
    • AWSͳͲଞͷΫϥ΢υͷSparkͳαʔϏεࣄ৘2022
    • SparkΛ࢖Θͳ͍, େ͖Ίͳσʔλॲཧͷ͖΄Μ for Google Cloud
    • Dash + Cloud RunΛ࢖͍͍ͬͯײ͡ʹσʔλՄࢹԽΞϓϦΛ࡞Δ
    • ࢀߟࢿྉ

    View full-size slide

  65. Dataproc ServerlessΛࣗಈ࣮ߦ͢Δ

    View full-size slide

  66. Dataproc ServerlessͷࣗಈԽ
    • ૝ఆ͞ΕΔखஈ͸ҎԼͷ3ͭ.
    1. APIΛ࢖͍͍ͬͯײ͡ʹ࣮ߦ͢ΔҝͷDocker imageΛ࡞੒
    ͜ΕΛԿ͔͠Βͷํ๏ͰContainerͱͯ͠ಈ͔͢ʢK8sͳͲʣ
    2. CLIʢgcloudίϚϯυʣͰ࣮ߦͰ͖ΔͷͰ, gcloudίϚϯυͷ
    Docker imageΛ࡞੒ʢҎԼ, 1.ͱಉจʣ
    3. AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢
    • 1.ͱ2.͸ۤߦ͔ͭ, αʔόϨεͷҙຯ͕ແ͘ͳΔՄೳੑ͕͋Δʢ͔ͭ, 1ͱ2͸΄΅ಉ͜͡ͱΛݴ͍ͬͯΔʣ.
    Cloud Run౳Ͱಈ͔ͤΕ͹Α͍͕, ߏஙɾӡ༻ͱ΋ʹϦεΫ͕͋Γͦ͏ͳ༧ײ.
    • ϕεϓϥͬΆ͍໛ൣղ౴͸ʮ3.AirflowͷOperatorΛ࢖ͬͯDataproc ServerlessΛಈ͔͢ʯͰ͢.

    View full-size slide

  67. ʲ໛ൣղ౴ʳAirflowͷOperatorΛܦ༝ͯ͠ಈ͔͢
    Google CloudͷϚωʔδυɾαʔϏεʮCloud ComposerʯΛ࢖͏ͱྑͦ͞͏

    View full-size slide

  68. Dataproc ServerlessͷॲཧࣗಈԽ
    • Google CloudͰcronతͳॲཧΛࣗಈԽ͢ΔͷʹPub/Sub + Schedulerʢ΋͘͠͸Cloud TaskʣΛ
    ࢖͏ͱ͍͏ϕετɾϓϥΫςΟε͕ଘࡏ͢Δ.
    • ͨͩ, 2022೥10݄ݱࡏ, Dataproc Serverless͸Pub/SubΛInterfaceͱ࣮ͯ͠ߦ͢Δखஈ͕ແ͍ҝ,
    ࢒೦ͳ͕Β͜ͷํ๏͸࢖͑ͳ͍.
    • ͳͷͰ, ࠷΋εϚʔτͳํ๏͸AirflowͷDataprocܥOperatorΛ࢖࣮ͬͯߦ͢Δ͜ͱʹͳΔ.
    Cloud ComposerͰAirflowΫϥελΛ্ཱͪ͛ͯӡ༻͢Δ.
    • https://cloud.google.com/composer/docs/composer-2/run-dataproc-workloads
    • ͪͳΈʹCloud Composer͸αʔόϨεͰ͸ͳ͍Ͱ͢ʢϑϧϚωʔδυͰ͸͋Δ͕ʣ
    &K8sʢGKEʣΫϥελΛཱͯΔ͜ͱʹͳΔͷͰίετ໘΋஫ҙʢ࣮຿͸ͱ΋͔͘ݸਓͰ࢖͏ʹ͸ߴ͍ʣ

    View full-size slide

  69. SparkΛΫϥ΢υͰ࢖͏
    Google CloudҎ֎ͷ৔߹

    View full-size slide

  70. Google CloudҎ֎ͷSparkαʔϏεબ୒ࢶ
    • AWS, Azureͦͯ͠ʢ͋Δҙຯ͝ຊՈͰ͋ΔʣDatabricks͕ީิ.
    • ύϒϦοΫΫϥ΢υΛΠϯϑϥͱͯ͠ѻ͏Ϣʔεέʔεͷ৔߹,
    Databricks͕࠷༗ྗީิʹʢϚϧνΫϥ΢υԽ͍ͨ͠౳ͷέʔεʣ.
    • ࣮͸͜ͷ෼໺, AWS͕ॆ࣮͍ͯͯ͠, EMRͱGlueͰϢʔεέʔεʹ
    ߹Θͤͯબ୒͢Δͱ͍͍Α͏ͳؾ͕͢Δ.
    • Azure͸৮ͬͨ͜ͱແ͍ͷͰΘ͔Βͳ͍…*

    View full-size slide

  71. Google CloudҎ֎ͷSparkαʔϏεબ୒ࢶ
    Ϋϥ΢υαʔϏε
    ˞શͯͰ͸ͳ͍Ͱ͢
    63- ֓ཁ
    %BUBCSJDLT IUUQTXXXEBUBCSJDLTDPNKQ
    ϚϧνΫϥ΢υ૝ఆͩͱબ୒ࢶʹ
    4QBSLͷੜΈͷ਌͕։ൃɾఏڙ
    "84&.3 IUUQTBXTBNB[PODPNKQFNS
    "84ͷϚωʔδυ4QBSL)BEPPQ
    4QBSLͱͯ͠࢖͏ͳΒͬͪ͜
    "84(MVF IUUQTBXTBNB[PODPNKQHMVF
    4QBSLΛ&5-ͱͯ͠࢖͏৔߹
    &.3ΑΓ(MVFΛ࢖͏ͷ͕ϕετ
    "[VSF)%*OTJHIU
    IUUQTB[VSFNJDSPTPGUDPNKBKQ
    TFSWJDFTIEJOTJHIUPWFSWJFX
    "[VSFʹ͓͚Δબ୒ࢶ
    ʢࢲ͸৮ͬͨ͜ͱͳ͍Ͱ͕͢ʜ

    View full-size slide

  72. SparkʢDataprocʣΛ࢖Θͳ͍
    ৔߹ͷ͍͍ײ͡ͳσʔλॲཧ
    for Google Cloud

    View full-size slide

  73. ͍͍ײ͡ͳσʔλॲཧ for Google Cloud
    • Dataflow
    • DataFusion
    • Dataprep
    • Cloud Run
    • Cloud Functions

    View full-size slide

  74. ༻్ʹ߹Θͤͯ࢖͍෼͚·͠ΐ͏ʂ
    (PPHMF$MPVE4FSWJDF 63- ֓ཁ
    %BUBqPX
    IUUQTDMPVEHPPHMFDPNEBUBqPX
    IMKB
    "QBDIF#FBN͕ϕʔε
    ετϦʔϛϯάॲཧͳΒ͜Ε
    %BUB'VTJPO
    IUUQTDMPVEHPPHMFDPNEBUB
    GVTJPOEPDT IMKB
    ΦϯϓϨΛؚΉɺطଘσʔλΛ
    औΓࠐΉ&5-తͳαʔϏε
    %BUBQSFQ
    IUUQTDMPVEHPPHMFDPNEBUBQSFQ
    IMKB
    σʔλલॲཧɾΫϨϯδϯάத৺
    ͲͪΒ͔ͱ͍͑͹ϩʔίʔυ
    $MPVE3VO IUUQTDMPVEHPPHMFDPNSVO IMKB
    ޷͖ͳݴޠɾ'8Ͱ࡞ΔͳΒ͜Ε
    1VC4VC౳ͰτϦΨʔͯ͠ಈ͔͢
    $MPVE'VODUJPOT
    IUUQTDMPVEHPPHMFDPNGVODUJPOT
    IMKB
    $MPVE3VOΑΓ੍໿͋Δ͕
    αΫοͱ࡞ͬͯಈ͔͢ͳΒ

    View full-size slide

  75. ݱ࣮తͳબ୒ࢶɾצॴ
    • ϦΞϧλΠϜܥͷॲཧ͸Dataflow͕࠷༗ྗͷબ୒ࢶ.
    • طଘͷσʔλͱ౷߹ͨ͠Γ·ͱΊͨΓ͸DataFusion.
    • ػցֶश౳ͷσʔλલॲཧ͸Dataprep.
    • PythonʹݶΒͣ, ࣗ෼Ͱ࡞ͬͯಈ͔͢ͳΒCloud Run.
    • ʮPandasͱBigQuery, GCS࢖͏ʯ͙Β͍ͳΒCloud FunctionsͰ
    αΫοͱ΍Ε·͢ʢ࣮͸͜ͷϢʔεέʔεଟ͍ͷͰ͸ʁʣ.

    View full-size slide

  76. Dash + Cloud RunͰӡ༻͢Δ
    σʔλՄࢹԽμογϡϘʔυ
    ※Spark͓ΑͼDataproc͸ొ৔͠·ͤΜ

    View full-size slide

  77. μογϡϘʔυΞϓϦʢຊฤͰׂѪͨ͠࿩ʣ
    • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε
    • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ
    • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

    View full-size slide

  78. μογϡϘʔυΞϓϦʢຊฤͰׂѪͨ͠࿩ʣ
    • ΞϓϦຊମ͸Cloud RunͰϗεςΟϯά, API GatewayΛ௨ͯ͠όοΫΤϯυʢCloud FunctionsʣʹΞΫηε
    • Firestore͕ϝΠϯͷDB, Cache໾ͷMemoryStoreʢRedisʣΛஔ͍͍ͯΔ
    • ͜͜Ͱ͸SparkʢPySparkʣ͸ొ৔͠·ͤΜ

    View full-size slide

  79. Dash + Cloud RunͰͷ
    ϗεςΟϯά
    • Dash͸Flask͕ݩʹͳͬͯΔͷͰ
    gunicornͰ͍͍ײ͡ʹಈ͔͢తͳ
    ํ๏ͰϗεςΟϯάՄೳ.
    • ͜Ε΋αʔόϨεͳͷͰ, ࢖ͬͨ࣌ؒɾϦιʔε
    ͚ͩಈ͘ײ͡ʹͳΔ, ࣗલͰՄࢹԽΞϓϦΛ
    ࡞Γ͍ͨํ͸΍ͬͯΈΔͱྑ͍͔΋?
    • ͪͳΈʹAWSͷ৔߹, App RunnerͰಉ͡
    ํ๏͕औΕΔͱࢥ͍·͢ʢࢼͯ͠͸͍·ͤΜ͕ʣ.

    View full-size slide

  80. ͳ͓, CI/CDϫʔΫϑϩʔ͸͜Μͳײ͡.
    • GitHub Repositoryʹpushͨ͠ΒGitHub Actions͕ൃՐ, ςετ -> Docker Build -> Cloud RunσϓϩΠ
    • ςετ͸pytest, flake8, mypyΛGitHub Actions্Ͱ࣮ࢪʢunit, integration·Ͱ୲อ͢ΔΠϝʔδʣ
    • Docker build͸Cloud Runͷඪ४తͳ΍Γํʹै͏.
    • Cloud Build্ͰϏϧυ
    • Artifact Registryʹpush
    • Cloud Run΁ͷσϓϩΠ͸Github ActionsͷެࣜΛ࢖࣮ͬͯࢪ.

    View full-size slide

  81. Spark / PySparkؔ࿈
    • PySpark Documents
    https://spark.apache.org/docs/latest/api/python/
    • ೖ໳PySparkɹ˞ͪΐͬͱݹ͍ॻ੶Ͱ͢, ಺༰ͱ͔஫ҙ.
    https://www.oreilly.co.jp/books/9784873118185/
    • PythonͰେྔσʔλॲཧʂ PySparkΛ༻͍ͨσʔλॲཧͱ෼ੳͷ͖΄Μ
    ʢPyCon JP 2017ʣ
    https://speakerdeck.com/chie8842/pythondeda-liang-detachu-li-
    pysparkwoyong-itadetachu-li-tofen-xi-falsekihon

    View full-size slide

  82. Google CloudʢDataprocʣ
    • ެࣜυΩϡϝϯτ
    https://cloud.google.com/dataproc/docs?hl=ja
    • PySparkͷެࣜαϯϓϧʢ͔͜͜Βࣸܦָ͕ʣ
    https://github.com/googleapis/python-dataproc
    • ެࣜαϯϓϧͦͷ2ʢΑΓ࣮ફతʣ
    https://github.com/GoogleCloudDataproc/cloud-dataproc

    View full-size slide

  83. Google Cloudʢॳ৺ऀɾ࢖͍͍ͨਓ޲͚ʣ
    • ެࣜυΩϡϝϯτ
    https://cloud.google.com/docs?hl=ja
    • ࢿ֨
    https://cloud.google.com/certification?hl=ja
    • ΤϯλʔϓϥΠζͷͨΊͷGoogle Cloudʢਪ͠ͷॻ੶Ͱ͢ʣ
    https://www.shoeisha.co.jp/book/detail/9784798175256

    View full-size slide

  84. ࣗ෼ͷϒϩάهࣄʢPySpark/Dataؔ࿈ʣ
    • ໺ٿͷϏοάσʔλΛGCPͱPySparkͰ͍͍ײ͡ʹ࢖͍΍ͯ͘͢͠Έͨ
    https://shinyorke.hatenablog.com/entry/dataproc-baseball
    • SparkΛαʔόʔ؅ཧͤͣʹ࢖͏ํ๏
    https://shinyorke.hatenablog.com/entry/dataproc-serverless
    • Google CloudͰSparkΛ࢖͏؀ڥΛαΫοͱखʹೖΕΔ
    https://shinyorke.hatenablog.com/entry/dataproc-terraform
    • WebΞϓϦͱσʔλج൫ΛαΫοͱ্ཱͪ͛ΔͨΊͷϓϥΫςΟε
    https://shinyorke.hatenablog.com/entry/cloud-arch-serverless

    View full-size slide

  85. ໺ٿؔ܎ͷࢀߟϒϩάɾίʔυ
    • ໺ٿ޷͖ͱσʔλ޷͖ͷͨΊͷStatcastσʔλೖ໳
    https://shinyorke.hatenablog.com/entry/statcast-csv-docs-ja
    • StatcastσʔλͱPlotlyΛ࢖ͬͯʮଧٿͷ౸ୡҐஔʯΛՄࢹԽ͢Δ
    https://shinyorke.hatenablog.com/entry/statcast-visualization-for-batting
    • Baseball SavantͰΦΦλχαϯͷσʔλΛோΊΔαϯϓϧ
    https://github.com/Shinichi-Nakagawa/baseball-savant-shohei-ohtani2022
    • RʹΑΔηΠόʔϝτϦΫεೖ໳
    https://gihyo.jp/book/2020/978-4-297-11684-2

    View full-size slide

  86. Done.
    ࠷ޙ·Ͱ͝ഈಡ͋Γ͕ͱ͏͍͟͝·ͨ͠.

    View full-size slide