Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical Data Science. An Introduction to Supe...

Practical Data Science. An Introduction to Supervised Machine Learning and Pattern Classification: The Big Picture @ NextGen Bioinformatics Michigan State

Practical Data Science. Slides of an 1 hour introductory talk about predictive modeling using Machine Learning with a focus on supervised learning.

Sebastian Raschka

February 11, 2015
Tweet

More Decks by Sebastian Raschka

Other Decks in Technology

Transcript

  1. Practical Data Science An Introduction to Supervised Machine Learning and

    Pattern Classification: The Big Picture Michigan State University NextGen Bioinformatics Seminars - 2015 Sebastian Raschka Feb. 11, 2015
  2. A Little Bit About Myself ... Developing software & methods

    for - Protein ligand docking - Large scale drug/inhibitor discovery PhD candidate in Dr. L. Kuhn’s Lab: and some other machine learning side-projects …
  3. What is Machine Learning? http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram "Field of study that gives

    computers the ability to learn without being explicitly programmed.” (Arthur Samuel, 1959) By Phillip Taylor [CC BY 2.0]
  4. Examples of Machine Learning http://googleresearch.blogspot.com/2014/11/a-picture-is-worth-thousand-coherent.html By Steve Jurvetson [CC BY

    2.0] Self-driving cars Photo search and many, many more ... Recommendation systems http://commons.wikimedia.org/wiki/File:Netflix_logo.svg [public domain]
  5. Learning • Labeled data • Direct feedback • Predict outcome/future

    • Decision process • Reward system • Learn series of actions • No labels • No feedback • “Find hidden structure” Unsupervised Supervised Reinforcement
  6. Unsupervised learning Supervised learning Clustering: [DBSCAN on a toy dataset]

    Classification: [SVM on 2 classes of the Wine dataset] Regression: [Soccer Fantasy Score prediction] Today’s topic Supervised Learning Unsupervised Learning
  7. Instances (samples, observations) Features (attributes, dimensions) Classes (targets) Nomenclature sepal_length

    sepal_width petal_length petal_width class 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa … … … … … … 50 6.4 3.2 4.5 1.5 veriscolor … … … … … … 150 5.9 3.0 5.1 1.8 virginica https://archive.ics.uci.edu/ml/datasets/Iris IRIS
  8. Feature Extraction Feature Selection Dimensionality Reduction Feature Scaling Raw Data

    Collection Pre-Processing Sampling Test Dataset Training Dataset Learning Algorithm Training Post-Processing Cross Validation Final Classification/ Regression Model New Data Pre-Processing Refinement Prediction Split Supervised Learning Sebastian Raschka 2014 Missing Data Performance Metrics Model Selection Hyperparameter Optimization This work is licensed under a Creative Commons Attribution 4.0 International License. Final Model Evaluation
  9. Feature Extraction Feature Selection Dimensionality Reduction Feature Scaling Raw Data

    Collection Pre-Processing Sampling Test Dataset Training Dataset Learning Algorithm Training Post-Processing Cross Validation Final Classification/ Regression Model New Data Pre-Processing Refinement Prediction Split Supervised Learning Sebastian Raschka 2014 Missing Data Performance Metrics Model Selection Hyperparameter Optimization This work is licensed under a Creative Commons Attribution 4.0 International License. Final Model Evaluation
  10. A Few Common Classifiers Decision Tree Perceptron Naive Bayes Ensemble

    Methods: Random Forest, Bagging, AdaBoost Support Vector Machine K-Nearest Neighbor Logistic Regression Artificial Neural Network / Deep Learning
  11. Discriminative Algorithms Generative Algorithms • Models a more general problem:

    how the data was generated. • I.e., the distribution of the class; joint probability distribution p(x,y). • Naive Bayes, Bayesian Belief Network classifier, Restricted Boltzmann Machine … • Map x → y directly. • E.g., distinguish between people speaking different languages without learning the languages. • Logistic Regression, SVM, Neural Networks …
  12. Examples of Discriminative Classifiers: Perceptron xi1 xi2 w1 w2 Ǩ

    yi y = wTx = w0 + w1x1 + w2x2 1 F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957. x1 x2 y ∈ {-1,1} w0 wj = weight xi = training sample yi = desired output yi = actual output t = iteration step η = learning rate θ = threshold (here 0) update rule: wj(t+1) = wj(t) + η(yi - yi)xi 1 if wTxi ≥ θ -1 otherwise ^ ^ ^ ^ yi ^ until t+1 = max iter or error = 0
  13. Discriminative Classifiers: Perceptron F. Rosenblatt. The perceptron, a perceiving and

    recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957. - Binary classifier (one vs all, OVA) - Convergence problems (set n iterations) - Modification: stochastic gradient descent - “Modern” perceptron: Support Vector Machine (maximize margin) - Multilayer perceptron (MLP) xi1 xi2 w1 w2 Ǩ yi 1 y ∈ {-1,1} w0 ^ x1 x2
  14. Generative Classifiers: Naive Bayes Bayes Theorem: P(ωj | xi) =

    P(xi | ωj) P(ωj) P(xi) Posterior probability = Likelihood x Prior probability Evidence Iris example: P(“Setosa"| xi), xi = [4.5 cm, 7.4 cm]
  15. Generative Classifiers: Naive Bayes Decision Rule: Bayes Theorem: P(ωj |

    xi) = P(xi | ωj) P(ωj) P(xi) pred. class label ωj argmax P(ωj | xi) i = 1, …, m e.g., j ∈ {Setosa, Versicolor, Virginica}
  16. Class-conditional probability (here Gaussian kernel): Generative Classifiers: Naive Bayes Prior

    probability: Evidence: (cancels out) (class frequency) P(ωj | xi) = P(xi | ωj) P(ωj) P(xi) P(ωj) = Nωj Nc P(xik |ωj) = 1 √ (2 π σωj 2) exp ( ) - (xik - μωj)2 2σωj 2 P(xi |ωj) P(xik |ωj)
  17. Generative Classifiers: Naive Bayes - Naive conditional independence assumption typically

    violated - Works well for small datasets - Multinomial model still quite popular for text classification (e.g., spam filter)
  18. Non-Parametric Classifiers: K-Nearest Neighbor - Simple! - Lazy learner -

    Very susceptible to curse of dimensionality k=3 e.g., k=1
  19. Decision Tree Entropy = depth = 4 petal length <=

    2.45? petal length <= 4.75? Setosa Virginica Versicolor Yes No e.g., 2 (- 0.5 log2(0.5)) = 1 ∑−pi logk pi i Information Gain = entropy(parent) – [avg entropy(children)] No Yes depth = 2
  20. "No Free Lunch" :( Roughly speaking: “No one model works

    best for all possible situations.” Our model is a simplification of reality Simplification is based on assumptions (model bias) Assumptions fail in certain situations D. H. Wolpert. The supervised learning no-free-lunch theorems. In Soft Computing and Industry, pages 25–42. Springer, 2002.
  21. Which Algorithm? • What is the size and dimensionality of

    my training set? • Is the data linearly separable? • How much do I care about computational efficiency? - Model building vs. real-time prediction time - Eager vs. lazy learning / on-line vs. batch learning - prediction performance vs. speed • Do I care about interpretability or should it "just work well?" • ...
  22. Feature Extraction Feature Selection Dimensionality Reduction Feature Scaling Raw Data

    Collection Pre-Processing Sampling Test Dataset Training Dataset Learning Algorithm Training Post-Processing Cross Validation Final Classification/ Regression Model New Data Pre-Processing Refinement Prediction Split Supervised Learning Sebastian Raschka 2014 Missing Data Performance Metrics Model Selection Hyperparameter Optimization This work is licensed under a Creative Commons Attribution 4.0 International License. Final Model Evaluation
  23. Missing Values: - Remove features (columns) - Remove samples (rows)

    - Imputation (mean, nearest neighbor, …) Sampling: - Random split into training and validation sets - Typically 60/40, 70/30, 80/20 - Don’t use validation set until the very end! (overfitting) Feature Scaling: e.g., standardization: - Faster convergence (gradient descent) - Distances on same scale (k-NN with Euclidean distance) - Mean centering for free - Normal distributed data - Numerical stability by avoiding small weights z = xik - μk σk (use same parameters for the test/new data!)
  24. Categorical Variables color size prize class label 0 green M

    10.1 class1 1 red L 13.5 class2 2 blue XL 15.3 class1 ordinal nominal green → (1,0,0) red → (0,1,0) blue → (0,0,1) class label color=blue color=green color=red prize size 0 0 0 1 0 10.1 1 1 1 0 0 1 13.5 2 2 0 1 0 0 15.3 3 M → 1 L → 2 XL → 3
  25. Feature Extraction Feature Selection Dimensionality Reduction Feature Scaling Raw Data

    Collection Pre-Processing Sampling Test Dataset Training Dataset Learning Algorithm Training Post-Processing Cross Validation Final Classification/ Regression Model New Data Pre-Processing Refinement Prediction Split Supervised Learning Sebastian Raschka 2014 Missing Data Performance Metrics Model Selection Hyperparameter Optimization This work is licensed under a Creative Commons Attribution 4.0 International License. Final Model Evaluation
  26. Error Metrics: Confusion Matrix TP [Linear SVM on sepal/petal lengths]

    TN FN FP here: “setosa” = “positive”
  27. Error Metrics TP [Linear SVM on sepal/petal lengths] TN FN

    FP here: “setosa” = “positive” TP + TN FP +FN +TP +TN Accuracy = = 1 - Error FP N TP P False Positive Rate = TP TP + FP Precision = True Positive Rate = (Recall) “micro” and “macro” averaging for multi-class
  28. Test set Training dataset Test dataset Complete dataset Test set

    Test set Test set 1st iteration calc. error calc. error calc. error calc. error calculate avg. error k-fold cross-validation (k=4): 2nd iteration 3rd iteration 4th iteration fold 1 fold 2 fold 3 fold 4 Model Selection
  29. Feature Selection - Domain knowledge - Variance threshold - Exhaustive

    search - Decision trees - … IMPORTANT! (Noise, overfitting, curse of dimensionality, efficiency) X = [x1, x2, x3, x4] start: stop: (if d = k) X = [x1, x3, x4] X = [x1, x3] Simplest example: Greedy Backward Selection
  30. Dimensionality Reduction • Transformation onto a new feature subspace •

    e.g., Principal Component Analysis (PCA) • Find directions of maximum variance • Retain most of the information
  31. 0. Standardize data 1. Compute covariance matrix z = xik

    - μk σik = ∑ (xij - µj) (xik - µk) σk 1 i n -1 σ2 1 σ12 σ13 σ14 σ21 σ2 2 σ23 σ24 σ31 σ32 σ2 3 σ34 σ41 σ42 σ43 σ2 4 ∑ = PCA in 3 Steps
  32. 2. Eigendecomposition and sorting eigenvalues PCA in 3 Steps X

    v = λ v Eigenvectors [[ 0.52237162 -0.37231836 -0.72101681 0.26199559] [-0.26335492 -0.92555649 0.24203288 -0.12413481] [ 0.58125401 -0.02109478 0.14089226 -0.80115427] [ 0.56561105 -0.06541577 0.6338014 0.52354627]] Eigenvalues [ 2.93035378 0.92740362 0.14834223 0.02074601] (from high to low)
  33. 3. Select top k eigenvectors and transform data PCA in

    3 Steps Eigenvectors [[ 0.52237162 -0.37231836 -0.72101681 0.26199559] [-0.26335492 -0.92555649 0.24203288 -0.12413481] [ 0.58125401 -0.02109478 0.14089226 -0.80115427] [ 0.56561105 -0.06541577 0.6338014 0.52354627]] Eigenvalues [ 2.93035378 0.92740362 0.14834223 0.02074601] [First 2 PCs of Iris]
  34. Feature Extraction Feature Selection Dimensionality Reduction Feature Scaling Raw Data

    Collection Pre-Processing Sampling Test Dataset Training Dataset Learning Algorithm Training Post-Processing Cross Validation Final Classification/ Regression Model New Data Pre-Processing Refinement Prediction Split Supervised Learning Sebastian Raschka 2014 Missing Data Performance Metrics Model Selection Hyperparameter Optimization This work is licensed under a Creative Commons Attribution 4.0 International License. Final Model Evaluation
  35. Inspiring Literature P. N. Klein. Coding the Matrix: Linear Algebra

    Through Computer Science Applications. Newtonian Press, 2013. R. Schutt and C. O’Neil. Doing Data Science: Straight Talk from the Frontline. O’Reilly Media, Inc., 2013. S. Gutierrez. Data Scientists at Work. Apress, 2014. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. 2nd. Edition. New York, 2001.