Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dask Distributedによる分散機械学習

Sinhrks
June 28, 2017
1.4k

Dask Distributedによる分散機械学習

@PyData Tokyo #13 Lightning Talk
https://pydatatokyo.connpass.com/event/58954/

Sinhrks

June 28, 2017
Tweet

Transcript

  1. import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10,

    20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ Dask DataFrame QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO
  2. Dask Distributed • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳ਺ϊʔυͰ෼ࢄͰ͖Δ • ௿ϨΠςϯγ: λεΫຖͷΦʔόʔϔου͸1msఔ౓ • WorkerؒͰͷσʔλڞ༗: σʔλసૹ͸WorkerؒͰ௚઀࣮ࢪ

    • ෳࡶͳεέδϡʔϦϯά: ೚ҙͷܭࢉάϥϑΛ࣮ߦՄ • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ΂͘ߦΘͳ͍ %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
  3. Scikit-Learnͷฒྻॲཧ • “n_jobs” Ҿ਺Ͱฒྻ࣮ߦ਺Λࢦఆ • ಺෦తʹ͸joblibΛར༻ • Scikit-Learnίϛολத৺ʹ։ൃ • ϊʔυ಺ฒྻ

    (threading, multiprocessing) from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, cv=3, n_jobs=12, param_grid=param_grid)
  4. Distributed joblib • ϓϥΨϒϧAPI (0.10.0-) • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ

    • ஫ҙ఺ • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ࢖͏ (sklearn.externals.joblib) • ෼ࢄͰ͖ͳ͍৔߹΋͋Δ • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔ΋ͷ import distributed.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’): grid.fit(digits.data, digits.target)
  5. dask-searchcv • Scikit-LearnͷϋΠύʔύϥϝʔλαʔνΛ Dask ޓ׵ʹͨ͠΋ͷ: • GridSearchCVͱRandomizedSearchCVΛαϙʔτ • API͸Scikit-Learnͱڞ௨ •

    Dask Array΍ DataFrameΛೖྗͱͯ͠౉ͤΔ • ಉҰɺಉύϥϝʔλͷֶशثͷ܁Γฦ࣮͠ߦΛආ͚Δ • PipelineॲཧͰ༗༻ ※աڈʹ dklearn ͱͯ͠ެ։͞Ε͍ͯͨύοέʔδͷҰ෦