Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Random Walk in Data Science and Machine Learn...

szilard
May 08, 2019
30

A Random Walk in Data Science and Machine Learning in Practice - Use Cases Seminar, MS Biz Analytics, CEU - Budapest, May 2019

szilard

May 08, 2019
Tweet

More Decks by szilard

Transcript

  1. A Random Walk in Data Science and Machine Learning in

    Practice Szilard Pafka, PhD Chief Scientist, Epoch (USA) CEU, Business Analytics Masters Budapest, May 2019
  2. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  3. Best Practices for Using Machine Learning in Businesses in 2018

    Szilárd Pafka, PhD Chief Scientist, Epoch (USA) Budapest BI Forum Conference November 2018
  4. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  5. *

  6. 10x

  7. ML training: lots of CPU cores lots of RAM limited

    time ML scoring: separated servers
  8. “people that know what they’re doing just use open source

    [...] the same open source tools that the MLaaS services offer” - Bradford Cross
  9. already pre-processed data less domain knowledge (or deliberately hidden) AUC

    0.0001 increases "relevant" no business metric no actual deployment models too complex no online evaluation no monitoring data leakage
  10. Aggregation 100M rows 1M groups Join 100M rows x 1M

    rows time [s] time [s] “Motherfucka!”
  11. AI?

  12. Better than Deep Learning: Gradient Boosting Machines (GBM) Szilard Pafka,

    PhD Chief Scientist, Epoch (USA) DataWorks Summit, Barcelona, Spain March 2019
  13. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  14. ...

  15. I usually use other people’s code [...] I can find

    open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang
  16. 10x

  17. 10x

  18. Machine Learning Software in Practice: Quo Vadis? Szilárd Pafka, PhD

    Chief Scientist, Epoch KDD Conference - Applied Data Science Track Invited Talk August 2017, Halifax, Canada
  19. Machine Learning Software in Practice: Quo Vadis? Szilárd Pafka, PhD

    Chief Scientist, Epoch KDD Conference - Applied Data Science Track Invited Talk August 2017, Halifax, Canada SOME OF
  20. ML Tools Mismatch: - What practitioners wish for - What

    they truly need - What’s available - What’s advertised - What developers/researchers focus on
  21. Warning: This talk is a series or rants observations with

    the aim to provoke encourage thinking and constructive discussions about topics of impact on our industry.
  22. Warning: This talk is a series or rants observations with

    the aim to provoke encourage thinking and constructive discussions about topics of impact on our industry. Rantometer:
  23. I usually use other people’s code [...] I can find

    open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang
  24. EC2

  25. n = 10K, 100K, 1M, 10M, 100M Training time RAM

    usage AUC CPU % by core read data, pre-process, score test data
  26. n = 10K, 100K, 1M, 10M, 100M Training time RAM

    usage AUC CPU % by core read data, pre-process, score test data
  27. 10x

  28. learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate

    = 0.01, max_depth = 16, n_trees = 1000
  29. Wishlist: - more datasets (10-100, structure, size) - automation: upgrading

    tools, re-running ($$) - more algos, more tools (OS/commercial?) - (even) more tuning of parameters
  30. Wishlist: - more datasets (10-100, structure, size) - automation: upgrading

    tools, re-running ($$) - more algos, more tools (OS/commercial?) - (even) more tuning of parameters - BaaS? crowdsourcing (data, tools/tuning)? - other ML problems (recsys, NLP…)
  31. Supervised Learning y = f(x) train: “learn” f from data

    X (n*p), y (n) score: f(x’) algos: k-NN, LR, NB, RF, GBM, SVM, NN, DL… goal: max accuracy measure (on new data) f ∈ F(θ) min θ ( L(y, f(x,θ)) + R(θ) ) on train set evaluate on separate test set /cross validation
  32. Model selection: Need Vary λ and get model with best

    accuracy on validation set Evaluate final model on test set /cross validation
  33. Disclaimer: I’m not affiliated with H2O.ai. It’s just that in

    my opinion H2O is a machine learning tool with several advantages. There are many other good tools (and many more awful ones).
  34. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API
  35. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani
  36. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani - Java, but C-style memalloc, by Java gurus - distributed, “big data”
  37. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani - Java, but C-style memalloc, by Java gurus - distributed, “big data” - many knobs/tuning, model evaluation, cross validation, model selection (hyperparameter search)
  38. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani - Java, but C-style memalloc, by Java gurus - distributed, “big data” - many knobs/tuning, model evaluation, cross validation, model selection (hyperparameter search)