Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Dask Distributedによる分散機械学習
Search
Sinhrks
June 28, 2017
4
1.4k
Dask Distributedによる分散機械学習
@PyData Tokyo #13 Lightning Talk
https://pydatatokyo.connpass.com/event/58954/
Sinhrks
June 28, 2017
Tweet
Share
More Decks by Sinhrks
See All by Sinhrks
daskperiment: Reproducibility for Humans
sinhrks
1
380
PythonとApache Arrow
sinhrks
6
1.9k
大規模データの機械学習におけるDaskの活用
sinhrks
10
3.1k
機械学習と解釈可能性
sinhrks
7
5.6k
LIME
sinhrks
2
1.3k
データ分析言語R 1年の振り返り
sinhrks
5
2.5k
pandasでのOSS活動事例と最初の一歩
sinhrks
2
19k
Data processing using pandas and Dask
sinhrks
1
240
pandasでのOSS活動事例
sinhrks
0
760
Featured
See All Featured
The MySQL Ecosystem @ GitHub 2015
samlambert
251
12k
Six Lessons from altMBA
skipperchong
27
3.7k
Automating Front-end Workflow
addyosmani
1369
200k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
22
2.6k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.2k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.8k
Embracing the Ebb and Flow
colly
85
4.6k
RailsConf 2023
tenderlove
29
1k
Navigating Team Friction
lara
184
15k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
25k
Transcript
Dask DistributedʹΑΔ ࢄػցֶश Masaaki Horikoshi @ ARISE analytics
ࣗݾհ • OSS׆ಈ: • GitHub: https://github.com/sinhrks
Daskͱ • ॊೈͳฒྻɾOut of CoreॲཧϑϨʔϜϫʔΫ • NumPy, pandasޓ(αϒηοτ)ͷσʔλߏΛఏڙ • λεΫಈతͳܭࢉάϥϑͱͯ͠දݱ͞Εɺεέδϡʔ
ϥʹΑͬͯฒྻ࣮ߦ • DaskΛར༻͢Δύοέʔδ(Ұ෦): Airflow
Dask DataFrame • ෳͷpandas DataFramesʹΑΓߏ • ॎʹׂ͞ΕͨDataFrame͝ͱʹॲཧΛฒྻԽ QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO
EJWJTJPO EJWJTJPO
import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10,
20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf ߦྻͷ QBOEBT%BUB'SBNFΛ࡞ Dask DataFrame QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO
Blocked Algorithm (߹ܭ) ddf.sum().compute() 4VN 4VN $PODBU 4VN ߹ܭ શମ
࿈݁ ߹ܭ QBSUJUJPO͝ͱ
Dask Distributed • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳϊʔυͰࢄͰ͖Δ • ϨΠςϯγ: λεΫຖͷΦʔόʔϔου1msఔ • WorkerؒͰͷσʔλڞ༗: σʔλసૹWorkerؒͰ࣮ࢪ
• ෳࡶͳεέδϡʔϦϯά: ҙͷܭࢉάϥϑΛ࣮ߦՄ • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ͘ߦΘͳ͍ %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
Scikit-Learnͷฒྻॲཧ • “n_jobs” ҾͰฒྻ࣮ߦΛࢦఆ • ෦తʹjoblibΛར༻ • Scikit-Learnίϛολத৺ʹ։ൃ • ϊʔυฒྻ
(threading, multiprocessing) from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, cv=3, n_jobs=12, param_grid=param_grid)
Distributed joblib • ϓϥΨϒϧAPI (0.10.0-) • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ
• ҙ • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ͏ (sklearn.externals.joblib) • ࢄͰ͖ͳ͍߹͋Δ • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔͷ import distributed.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’): grid.fit(digits.data, digits.target)
dask-searchcv • Scikit-LearnͷϋΠύʔύϥϝʔλαʔνΛ Dask ޓʹͨ͠ͷ: • GridSearchCVͱRandomizedSearchCVΛαϙʔτ • APIScikit-Learnͱڞ௨ •
Dask Array DataFrameΛೖྗͱͯͤ͠Δ • ಉҰɺಉύϥϝʔλͷֶशثͷ܁Γฦ࣮͠ߦΛආ͚Δ • PipelineॲཧͰ༗༻ ※աڈʹ dklearn ͱͯ͠ެ։͞Ε͍ͯͨύοέʔδͷҰ෦