Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Machine Learning - Challenges & Opp...

Distributed Machine Learning - Challenges & Opportunities

Talk presented at Fifth Elephant 2017.

Anand Chitipothu

July 27, 2017
Tweet

More Decks by Anand Chitipothu

Other Decks in Technology

Transcript

  1. Who is Speaking? Anand Chi)pothu @anandology • Building a data

    science pla0orm at @rorodata • Advanced programming courses at @pipalacademy • Worked at Strand Life Sciences and Internet Archive 2
  2. Mo#va#on • Training ML models o1en takes long 4me •

    Distributed approach is very scalable and effec4ve • The exis4ng tools for distributed training are not simple to use 3
  3. Machine Learning - Tradi/onal Workflow Typical workflow for building an

    ML model: • Data Prepara*on • Feature Extrac*on • Model Training • Hyper parameter op+miza+on / Grid Search 6
  4. Opportuni)es The grid search is one of the most 1me

    consuming parts and the has the poten1al to be parallelized. 8
  5. Data Parallelism - Examples • GPU computa-on • Open MP,

    MPI • Spark ML algorithms • Map-Reduce 11
  6. How to Parallelize Grid Search? The scikit-learn library of Python

    has an out-of the box solu6on to parallelize this. grid_search = GridSearchCV(model, parameters, n_jobs=4) But limited to the one computer! Can we run this on mul5ple computers? 14
  7. Challenges • Requires se*ng up a managing cluster of computers

    • Non-trivial task for a data scien:st to manage • How to start on demand and shutdown when unused Is it possible to have a simple interface that a data scien4st can manage on his/her own? 17
  8. Compute Pla,orm We've built a compute pla0orm for running jobs

    in the cloud. $ run-job python model_training.py created new job 9845a3bd4. 19
  9. Behind the Scenes • Picks an available instance in the

    cloud (or starts a new one) • Runs a docker container with appropriate image • Exposes the required ports, setup a URL endpoint to access it • Manages shared disk across all the jobs 20
  10. The Magic Running on a 16 core instance is just

    a flag away. $ run-job -i C16 python model_training.py created new job 8f40f02f. 21
  11. 22

  12. Distributed Machine Learning We've implemented multiprocessing.Pool like interface to run

    on the top of our compute pla5orm. pool = DistributedPool(n=5) results = pool.map(square, range(100)) pool.close() Starts 5 distributed jobs to share the work. 23
  13. Scikit-learn Integra/on Extended the distributed interface to support scikit-learn. from

    distributed_scikit import GridSearchCV grid_search = GridSearchCV( GradientBoostingRegressor(), parameters, n_jobs=16) A distributed pool with n_jobs will be created to distribute the tasks. 24
  14. Advantages • Simplicity • No manual setup required • Works

    from the familiar notebook interface • Op;on to run on spot instances (without any addi;onal setup) 25
  15. Summary • With ever increasing datasets, distributed training will be

    more effec8ve than single node approaches • Abstrac8ng away complexity of distributed learning can improve 8me-to-market 27