Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Advanced Machine Learning with Scikit-Learn for...

Advanced Machine Learning with Scikit-Learn for Pycon Amsterdam

Andreas Mueller

April 14, 2016
Tweet

More Decks by Andreas Mueller

Other Decks in Technology

Transcript

  1. 3 Classification Regression Clustering Semi-Supervised Learning Feature Selection Feature Extraction

    Manifold Learning Dimensionality Reduction Kernel Approximation Hyperparameter Optimization Evaluation Metrics Out-of-core learning …...
  2. 4

  3. 5 Overview • Reminder: Basic scikit-learn concepts • Working with

    text data • Model building and evaluation: – Pipelines – Randomized Parameter Search – Scoring Interface • Out of Core learning – Feature Hashing – Kernel Approximation • New stuff in 0.17 and 0.18-dev – Overview – Calibration
  4. 7 Representing Data X = 1.1 2.2 3.4 5.6 1.0

    6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3
  5. 8 Representing Data X = 1.1 2.2 3.4 5.6 1.0

    6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 one sample
  6. 9 Representing Data X = 1.1 2.2 3.4 5.6 1.0

    6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 one sample one feature
  7. 10 Representing Data X = y = 1.1 2.2 3.4

    5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 1.6 2.7 4.4 0.5 0.2 5.6 6.7 one sample one feature outputs / labels
  8. 11 Training Data Training Labels Model Supervised Machine Learning clf

    = RandomForestClassifier() clf.fit(X_train, y_train)
  9. 12 Training Data Test Data Training Labels Model Prediction Supervised

    Machine Learning clf = RandomForestClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test)
  10. 13 clf.score(X_test, y_test) Training Data Test Data Training Labels Model

    Prediction Test Labels Evaluation Supervised Machine Learning clf = RandomForestClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test)
  11. 15 pca = PCA(n_components=3) pca.fit(X_train) X_new = pca.transform(X_test) Training Data

    Test Data Model Transformation Unsupervised Transformations
  12. 16 Basic API estimator.fit(X, [y]) estimator.predict estimator.transform Classification Preprocessing Regression

    Dimensionality reduction Clustering Feature selection Feature extraction
  13. 22 Cross-Validation from sklearn.cross_validation import cross_val_score scores = cross_val_score(SVC(), X,

    y, cv=5) print(scores) >> [ 0.92 1. 1. 1. 1. ] cv_ss = ShuffleSplit(len(X_train), test_size=.3, n_iter=10) scores_shuffle_split = cross_val_score(SVC(), X, y, cv=cv_ss)
  14. 23 Cross-Validation from sklearn.cross_validation import cross_val_score scores = cross_val_score(SVC(), X,

    y, cv=5) print(scores) >> [ 0.92 1. 1. 1. 1. ] cv_ss = ShuffleSplit(len(X_train), test_size=.3, n_iter=10) scores_shuffle_split = cross_val_score(SVC(), X, y, cv=cv_ss) cv_labels = LeaveOneLabelOut(labels) scores_pout = cross_val_score(SVC(), X, y, cv=cv_labels)
  15. 26 All Data Training data Test data Fold 1 Fold

    2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Split 1 Split 2 Split 3 Split 4 Split 5
  16. 27 All Data Training data Test data Fold 1 Fold

    2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Finding Parameters Final evaluation Split 1 Split 2 Split 3 Split 4 Split 5
  17. 28 Cross -Validated Grid Search from sklearn.grid_search import GridSearchCV from

    sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) param_grid = {'C': 10. ** np.arange(-3, 3), 'gamma': 10. ** np.arange(-3, 3)} grid = GridSearchCV(SVC(), param_grid=param_grid) grid.fit(X_train, y_train) grid.predict(X_test) grid.score(X_test, y_test)
  18. 30 Review: One of the worst movies I've ever rented.

    Sorry it had one of my favorite actors on it (Travolta) in a nonsense role. In fact, anything made sense in this movie. Who can say there was true love between Eddy and Maureen? Don't you remember the beginning of the movie ? Is she so lovely? Ask her daughters. I don't think so. Label: negative Training data: 12500 positive, 12500 negative IMDB Movie Reviews Data
  19. 32 Bag Of Word Representations “This is how you get

    ants.” CountVectorizer / TfidfVectorizer
  20. 33 Bag Of Word Representations “This is how you get

    ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer
  21. 34 Bag Of Word Representations “This is how you get

    ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']
  22. 35 Bag Of Word Representations “This is how you get

    ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Sparse matrix encoding Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']
  23. 37 N-grams (unigrams and bigrams) “This is how you get

    ants.” CountVectorizer / TfidfVectorizer
  24. 38 N-grams (unigrams and bigrams) “This is how you get

    ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer Unigram tokenizer
  25. 39 N-grams (unigrams and bigrams) “This is how you get

    ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer Unigram tokenizer “This is how you get ants.” ['this is', 'is how', 'how you', 'you get', 'get ants'] Bigram tokenizer
  26. 47 Pipelines pipe.fit(X, y) T1 X y T1.fit(X, y) T2.fit(X1,

    y) Classifier.fit(X2, y) T1.transform(X) pipe.predict(X') X' y' Classifier.predict(X'2) T2 Classifier T2 T1 X1 y T2.transform(X1) X2 y Classifier T1.transform(X')X'1 T2.transform(X'1) X'2 pipe = make_pipeline(T1(), T2(), Classifier())
  27. 52 Randomized Parameter Search Source: Bergstra and Bengio Step-size free

    for continuous parameters Decouples runtime from search-space size Robust against irrelevant parameters
  28. 53 Randomized Parameter Search params = {'featureunion__countvectorizer-1__ngram_range': [(1, 3), (1,

    5), (2, 5)], 'featureunion__countvectorizer-2__ngram_range': [(1, 1), (1, 2), (2, 2)], 'linearsvc__C': 10. ** np.arange(-3, 3)}
  29. 54 Randomized Parameter Search params = {'featureunion__countvectorizer-1__ngram_range': [(1, 3), (1,

    5), (2, 5)], 'featureunion__countvectorizer-2__ngram_range': [(1, 1), (1, 2), (2, 2)], 'linearsvc__C': expon()}
  30. 55 Randomized Parameter Search rs = RandomizedSearchCV(text_pipe, param_distributions=param_distributins, n_iter=50) params

    = {'featureunion__countvectorizer-1__ngram_range': [(1, 3), (1, 5), (2, 5)], 'featureunion__countvectorizer-2__ngram_range': [(1, 1), (1, 2), (2, 2)], 'linearsvc__C': expon()}
  31. 56 Randomized Parameter Search • Always use distributions for continuous

    variables. • Don't use for low dimensional spaces.
  32. 62 rfe = RFE(LogisticRegression()) param_grid = {'n_features_to_select': range(1, n_features)} gridsearch

    = GridSearchCV(rfe, param_grid) grid.fit(X, y) rfecv = RFECV(LogisticRegression())
  33. 63 rfe = RFE(LogisticRegression()) param_grid = {'n_features_to_select': range(1, n_features)} gridsearch

    = GridSearchCV(rfe, param_grid) grid.fit(X, y) rfecv = RFECV(LogisticRegression()) rfecv.fit(X, y)
  34. 64

  35. 65 Linear Models Feature Selection Tree-Based models [possible] LogisticRegressionCV [new]

    RFECV [DecisionTreeCV] RidgeCV [RandomForestClassifierCV] RidgeClassifierCV [GradientBoostingClassifierCV] LarsCV ElasticNetCV ...
  36. 72 • Large Scale – “Out of core: Fits on

    a hard disk but in RAM” • Non-linear – because real-world problems are not.
  37. 73 • Large Scale – “Out of core: Fits on

    a hard disk but in RAM” • Non-linear – because real-world problems are not. • Single CPU – Because parallelization is hard (and often unnecessary)
  38. 74 Think twice! • Old laptop: 4GB Ram • 1073741824

    float32 • Or 1mio data points with 1000 features • EC2 : 256 GB Ram • 68719476736 float32 • Or 68mio data points with 1000 features
  39. 75

  40. 77 Supported Algorithms • All SGDClassifier derivatives • Naive Bayes

    • MinibatchKMeans • IncrementalPCA • MiniBatchDictionaryLearning • MultilayerPerceptron (dev branch) • Scalers
  41. 78 Out of Core Learning sgd = SGDClassifier() for i

    in range(9): X_batch, y_batch = cPickle.load(open("batch_%02d" % i)) sgd.partial_fit(X_batch, y_batch, classes=range(10)) Possibly go over the data multiple times.
  42. 80 Text Classification: Bag Of Word “This is how you

    get ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] tokenizer Sparse matrix encoding Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']
  43. 81 Text Classification: Hashing Trick “This is how you get

    ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ['this', 'is', 'how', 'you', 'get', 'ants'] tokenizer Sparse matrix encoding hashing [hash('this'), hash('is'), hash('how'), hash('you'), hash('get'), hash('ants')] = [832412, 223788, 366226, 81185, 835749, 173092]
  44. 87 Complexity • Solving kernelized SVM: ~O(n_samples ** 3) •

    Solving linear (primal) SVM: ~O(n_samples * n_features) n_samples large? Go primal!
  45. 89 Usage sgd = SGDClassifier() kernel_approximation = RBFSampler(gamma=.001, n_components=400) for

    i in range(9): X_batch, y_batch = cPickle.load(open("batch_%02d" % i)) if i == 0: kernel_approximation.fit(X_batch) X_transformed = kernel_approximation.transform(X_batch) sgd.partial_fit(X_transformed, y_batch, classes=range(10))
  46. 92 How • “fit” method • set_params and get_params (or

    inherit) • Run check_estimator See the “build your own estimator” docs!
  47. Latent Dirichlet Allocation using online variational inference By Chyi-Kwei Yau,

    based on code by Matt Hoffman Topic #0: government people mr law gun state president states public use right rights national new control american security encryption health united Topic #1: drive card disk bit scsi use mac memory thanks pc does video hard speed apple problem used data monitor software Topic #2: said people armenian armenians turkish did saw went came women killed children turkey told dead didn left started greek war Topic #3: year good just time game car team years like think don got new play games ago did season better ll Topic #4: 10 00 15 25 12 11 20 14 17 16 db 13 18 24 30 19 27 50 21 40 Topic #5: windows window program version file dos use files available display server using application set edu motif package code ms software Topic #6: edu file space com information mail data send available program ftp email entry info list output nasa address anonymous internet Topic #7: ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75u 2tm bxn 7ey Topic #8: god people jesus believe does say think israel christian true life jews did bible don just know world way church Topic #9: don know like just think ve want does use good people key time way make problem really work say need
  48. Coordinate Descent Solver for Non-Negative Matrix Factorization By Tom Dupre

    la Tour and Mathieu Blondel Topics in NMF model: Topic #0: don people just like think know time good right ve make say want did really way new use going said Topic #1: windows file dos files window program use running using version ms problem server pc screen ftp run application os software Topic #2: god jesus bible christ faith believe christians christian heaven sin hell life church truth lord say belief does existence man Topic #3: geb dsl n3jxp chastity cadre shameful pitt intellect skepticism surrender gordon banks soon edu lyme blood weight patients medical probably Topic #4: key chip encryption clipper keys escrow government algorithm secure security encrypted public des nsa enforcement bit privacy law secret use Topic #5: drive scsi ide drives disk hard controller floppy hd cd mac boot rom cable internal tape bus seagate bios quantum Topic #6: game team games players year hockey season play win league teams nhl baseball player detroit toronto runs pitching best playoffs Topic #7: thanks mail does know advance hi info looking anybody address appreciated help email information send ftp post interested list appreciate Topic #8: card video monitor vga bus drivers cards color driver ram ati mode memory isa graphics vesa pc vlb diamond bit Topic #9: 00 sale 50 shipping 20 10 price 15 new 25 30 dos offer condition 40 cover asking 75 interested 01
  49. VotingClassifier clf1 = LogisticRegression() clf2 = RandomForestClassifier() clf3 = GaussianNB()

    eclf = VotingClassifier( estimators=[('lr', clf1), ('rf', clf2), ('gbn', clf3)], voting=”hard”)
  50. Gaussian Process Rewrite 34.4**2 * RBF(length_scale=41.8) + 3.27**2 * RBF(length_scale=180)

    * ExpSineSquared(length_scale=1.44, periodicity=1) + 0.446**2 * RationalQuadratic(alpha=17.7, length_scale=0.957) + 0.197**2 * RBF(length_scale=0.138) + WhiteKernel(noise_level=0.0336) By Jan Hendrik Metzen.