Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Dask Distributedによる分散機械学習
Search
Sinhrks
June 28, 2017
4
1.5k
Dask Distributedによる分散機械学習
@PyData Tokyo #13 Lightning Talk
https://pydatatokyo.connpass.com/event/58954/
Sinhrks
June 28, 2017
Tweet
Share
More Decks by Sinhrks
See All by Sinhrks
daskperiment: Reproducibility for Humans
sinhrks
1
410
PythonとApache Arrow
sinhrks
6
1.9k
大規模データの機械学習におけるDaskの活用
sinhrks
10
3.2k
機械学習と解釈可能性
sinhrks
7
5.7k
LIME
sinhrks
2
1.4k
データ分析言語R 1年の振り返り
sinhrks
5
2.5k
pandasでのOSS活動事例と最初の一歩
sinhrks
2
19k
Data processing using pandas and Dask
sinhrks
1
270
pandasでのOSS活動事例
sinhrks
0
790
Featured
See All Featured
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
118
20k
Optimising Largest Contentful Paint
csswizardry
37
3.5k
Building a Scalable Design System with Sketch
lauravandoore
463
34k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
3.8k
Product Roadmaps are Hard
iamctodd
PRO
55
12k
For a Future-Friendly Web
brad_frost
180
10k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
34
2.3k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.5k
Thoughts on Productivity
jonyablonski
73
4.9k
Into the Great Unknown - MozCon
thekraken
40
2.2k
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Rebuilding a faster, lazier Slack
samanthasiow
84
9.3k
Transcript
Dask DistributedʹΑΔ ࢄػցֶश Masaaki Horikoshi @ ARISE analytics
ࣗݾհ • OSS׆ಈ: • GitHub: https://github.com/sinhrks
Daskͱ • ॊೈͳฒྻɾOut of CoreॲཧϑϨʔϜϫʔΫ • NumPy, pandasޓ(αϒηοτ)ͷσʔλߏΛఏڙ • λεΫಈతͳܭࢉάϥϑͱͯ͠දݱ͞Εɺεέδϡʔ
ϥʹΑͬͯฒྻ࣮ߦ • DaskΛར༻͢Δύοέʔδ(Ұ෦): Airflow
Dask DataFrame • ෳͷpandas DataFramesʹΑΓߏ • ॎʹׂ͞ΕͨDataFrame͝ͱʹॲཧΛฒྻԽ QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO
EJWJTJPO EJWJTJPO
import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10,
20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf ߦྻͷ QBOEBT%BUB'SBNFΛ࡞ Dask DataFrame QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO
Blocked Algorithm (߹ܭ) ddf.sum().compute() 4VN 4VN $PODBU 4VN ߹ܭ શମ
࿈݁ ߹ܭ QBSUJUJPO͝ͱ
Dask Distributed • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳϊʔυͰࢄͰ͖Δ • ϨΠςϯγ: λεΫຖͷΦʔόʔϔου1msఔ • WorkerؒͰͷσʔλڞ༗: σʔλసૹWorkerؒͰ࣮ࢪ
• ෳࡶͳεέδϡʔϦϯά: ҙͷܭࢉάϥϑΛ࣮ߦՄ • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ͘ߦΘͳ͍ %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
Scikit-Learnͷฒྻॲཧ • “n_jobs” ҾͰฒྻ࣮ߦΛࢦఆ • ෦తʹjoblibΛར༻ • Scikit-Learnίϛολத৺ʹ։ൃ • ϊʔυฒྻ
(threading, multiprocessing) from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, cv=3, n_jobs=12, param_grid=param_grid)
Distributed joblib • ϓϥΨϒϧAPI (0.10.0-) • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ
• ҙ • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ͏ (sklearn.externals.joblib) • ࢄͰ͖ͳ͍߹͋Δ • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔͷ import distributed.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’): grid.fit(digits.data, digits.target)
dask-searchcv • Scikit-LearnͷϋΠύʔύϥϝʔλαʔνΛ Dask ޓʹͨ͠ͷ: • GridSearchCVͱRandomizedSearchCVΛαϙʔτ • APIScikit-Learnͱڞ௨ •
Dask Array DataFrameΛೖྗͱͯͤ͠Δ • ಉҰɺಉύϥϝʔλͷֶशثͷ܁Γฦ࣮͠ߦΛආ͚Δ • PipelineॲཧͰ༗༻ ※աڈʹ dklearn ͱͯ͠ެ։͞Ε͍ͯͨύοέʔδͷҰ෦