Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Machine Learning - Challenges & Opp...

Distributed Machine Learning - Challenges & Opportunities

Talk presented at Fifth Elephant 2017.

Avatar for Anand Chitipothu

Anand Chitipothu

July 27, 2017
Tweet

More Decks by Anand Chitipothu

Other Decks in Technology

Transcript

  1. Who is Speaking? Anand Chi)pothu @anandology • Building a data

    science pla0orm at @rorodata • Advanced programming courses at @pipalacademy • Worked at Strand Life Sciences and Internet Archive 2
  2. Mo#va#on • Training ML models o1en takes long 4me •

    Distributed approach is very scalable and effec4ve • The exis4ng tools for distributed training are not simple to use 3
  3. Machine Learning - Tradi/onal Workflow Typical workflow for building an

    ML model: • Data Prepara*on • Feature Extrac*on • Model Training • Hyper parameter op+miza+on / Grid Search 6
  4. Opportuni)es The grid search is one of the most 1me

    consuming parts and the has the poten1al to be parallelized. 8
  5. Data Parallelism - Examples • GPU computa-on • Open MP,

    MPI • Spark ML algorithms • Map-Reduce 11
  6. How to Parallelize Grid Search? The scikit-learn library of Python

    has an out-of the box solu6on to parallelize this. grid_search = GridSearchCV(model, parameters, n_jobs=4) But limited to the one computer! Can we run this on mul5ple computers? 14
  7. Challenges • Requires se*ng up a managing cluster of computers

    • Non-trivial task for a data scien:st to manage • How to start on demand and shutdown when unused Is it possible to have a simple interface that a data scien4st can manage on his/her own? 17
  8. Compute Pla,orm We've built a compute pla0orm for running jobs

    in the cloud. $ run-job python model_training.py created new job 9845a3bd4. 19
  9. Behind the Scenes • Picks an available instance in the

    cloud (or starts a new one) • Runs a docker container with appropriate image • Exposes the required ports, setup a URL endpoint to access it • Manages shared disk across all the jobs 20
  10. The Magic Running on a 16 core instance is just

    a flag away. $ run-job -i C16 python model_training.py created new job 8f40f02f. 21
  11. 22

  12. Distributed Machine Learning We've implemented multiprocessing.Pool like interface to run

    on the top of our compute pla5orm. pool = DistributedPool(n=5) results = pool.map(square, range(100)) pool.close() Starts 5 distributed jobs to share the work. 23
  13. Scikit-learn Integra/on Extended the distributed interface to support scikit-learn. from

    distributed_scikit import GridSearchCV grid_search = GridSearchCV( GradientBoostingRegressor(), parameters, n_jobs=16) A distributed pool with n_jobs will be created to distribute the tasks. 24
  14. Advantages • Simplicity • No manual setup required • Works

    from the familiar notebook interface • Op;on to run on spot instances (without any addi;onal setup) 25
  15. Summary • With ever increasing datasets, distributed training will be

    more effec8ve than single node approaches • Abstrac8ng away complexity of distributed learning can improve 8me-to-market 27