Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Dask Distributedによる分散機械学習
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Sinhrks
June 28, 2017
1.5k
4
Share
Dask Distributedによる分散機械学習
@PyData Tokyo #13 Lightning Talk
https://pydatatokyo.connpass.com/event/58954/
Sinhrks
June 28, 2017
More Decks by Sinhrks
See All by Sinhrks
daskperiment: Reproducibility for Humans
sinhrks
1
440
PythonとApache Arrow
sinhrks
6
2k
大規模データの機械学習におけるDaskの活用
sinhrks
10
3.3k
機械学習と解釈可能性
sinhrks
7
5.8k
LIME
sinhrks
2
1.4k
データ分析言語R 1年の振り返り
sinhrks
5
2.6k
pandasでのOSS活動事例と最初の一歩
sinhrks
2
19k
Data processing using pandas and Dask
sinhrks
1
300
pandasでのOSS活動事例
sinhrks
0
830
Featured
See All Featured
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
560
Building AI with AI
inesmontani
PRO
1
1k
Paper Plane (Part 1)
katiecoart
PRO
0
7.7k
What does AI have to do with Human Rights?
axbom
PRO
1
2.1k
Docker and Python
trallard
47
3.8k
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.2k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.7k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.7k
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
150
Producing Creativity
orderedlist
PRO
348
40k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
128
55k
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
2
1.5k
Transcript
Dask DistributedʹΑΔ ࢄػցֶश Masaaki Horikoshi @ ARISE analytics
ࣗݾհ • OSS׆ಈ: • GitHub: https://github.com/sinhrks
Daskͱ • ॊೈͳฒྻɾOut of CoreॲཧϑϨʔϜϫʔΫ • NumPy, pandasޓ(αϒηοτ)ͷσʔλߏΛఏڙ • λεΫಈతͳܭࢉάϥϑͱͯ͠දݱ͞Εɺεέδϡʔ
ϥʹΑͬͯฒྻ࣮ߦ • DaskΛར༻͢Δύοέʔδ(Ұ෦): Airflow
Dask DataFrame • ෳͷpandas DataFramesʹΑΓߏ • ॎʹׂ͞ΕͨDataFrame͝ͱʹॲཧΛฒྻԽ QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO
EJWJTJPO EJWJTJPO
import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10,
20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf ߦྻͷ QBOEBT%BUB'SBNFΛ࡞ Dask DataFrame QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO
Blocked Algorithm (߹ܭ) ddf.sum().compute() 4VN 4VN $PODBU 4VN ߹ܭ શମ
࿈݁ ߹ܭ QBSUJUJPO͝ͱ
Dask Distributed • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳϊʔυͰࢄͰ͖Δ • ϨΠςϯγ: λεΫຖͷΦʔόʔϔου1msఔ • WorkerؒͰͷσʔλڞ༗: σʔλసૹWorkerؒͰ࣮ࢪ
• ෳࡶͳεέδϡʔϦϯά: ҙͷܭࢉάϥϑΛ࣮ߦՄ • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ͘ߦΘͳ͍ %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
Scikit-Learnͷฒྻॲཧ • “n_jobs” ҾͰฒྻ࣮ߦΛࢦఆ • ෦తʹjoblibΛར༻ • Scikit-Learnίϛολத৺ʹ։ൃ • ϊʔυฒྻ
(threading, multiprocessing) from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, cv=3, n_jobs=12, param_grid=param_grid)
Distributed joblib • ϓϥΨϒϧAPI (0.10.0-) • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ
• ҙ • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ͏ (sklearn.externals.joblib) • ࢄͰ͖ͳ͍߹͋Δ • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔͷ import distributed.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’): grid.fit(digits.data, digits.target)
dask-searchcv • Scikit-LearnͷϋΠύʔύϥϝʔλαʔνΛ Dask ޓʹͨ͠ͷ: • GridSearchCVͱRandomizedSearchCVΛαϙʔτ • APIScikit-Learnͱڞ௨ •
Dask Array DataFrameΛೖྗͱͯͤ͠Δ • ಉҰɺಉύϥϝʔλͷֶशثͷ܁Γฦ࣮͠ߦΛආ͚Δ • PipelineॲཧͰ༗༻ ※աڈʹ dklearn ͱͯ͠ެ։͞Ε͍ͯͨύοέʔδͷҰ෦