Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
How to use scikit-learn to solve machine learni...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Olivier Grisel
April 22, 2015
Technology
1.1k
0
Share
How to use scikit-learn to solve machine learning problems
AutoML Hackathon - Paris - April 2015
Olivier Grisel
April 22, 2015
More Decks by Olivier Grisel
See All by Olivier Grisel
Intro to scikit-learn
ogrisel
5
730
An Intro to Deep Learning
ogrisel
1
320
Predictive Modeling and Deep Learning
ogrisel
2
380
Intro to scikit-learn and what's new in 0.17
ogrisel
1
400
Big Data, Predictive Modeling and tools
ogrisel
2
310
Recent Developments in Deep Learning
ogrisel
3
710
Documentation
ogrisel
2
270
Build and test wheel packages on Linux, OSX and Windows
ogrisel
2
360
Big Data and Predictive Modeling
ogrisel
3
250
Other Decks in Technology
See All in Technology
Physical AI on AWS リファレンスアーキテクチャ / Physical AI on AWS Reference Architecture
aws_shota
1
330
「決め方」の渡し方 / How to hand over the "decision-making process"
pauli
4
450
GitHub Advanced Security × Defender for Cloudで開発とSecOpsのサイロを超える: コードとクラウドをつなぐ、開発プラットフォームのセキュリティ
yuriemori
1
120
"まず試す"ためのDatabricks Apps活用法 / Databricks Apps for Early Experiments and Validation
nttcom
1
100
40代からのアウトプット ― 経験は価値ある学びに変わる / 20260404 Naoki Takahashi
shift_evolve
PRO
5
770
Databricks Lakehouse Federationで 運用負荷ゼロのデータ連携
nek0128
0
110
バックオフィスPJのPjMをコーポレートITが担うとうまくいく3つの理由
yueda256
1
220
「活動」は激変する。「ベース」は変わらない ~ 4つの軸で捉える_AI時代ソフトウェア開発マネジメント
sentokun
0
140
【AWS】CloudTrail LakeとCloudWatch Logs Insightsの使い分け方針
tsurunosd
0
130
非同期・イベント駆動処理の分散トレーシングの繋げ方
ichikawaken
1
260
TUNA Camp 2026 京都Stage ヒューリスティックアルゴリズム入門
terryu16
0
670
Network Firewall Proxyで 自前プロキシを消し去ることができるのか
gusandayo
0
180
Featured
See All Featured
Build The Right Thing And Hit Your Dates
maggiecrowley
39
3.1k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
260
The Mindset for Success: Future Career Progression
greggifford
PRO
0
290
Producing Creativity
orderedlist
PRO
348
40k
Un-Boring Meetings
codingconduct
0
250
A Tale of Four Properties
chriscoyier
163
24k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
GraphQLとの向き合い方2022年版
quramy
50
14k
The browser strikes back
jonoalderson
0
870
We Are The Robots
honzajavorek
0
210
Transcript
How to use scikit-learn to solve machine learning problems AutoML
Hackathon April 2015
Outline • Machine Learning refresher • scikit-learn • Demo: interactive
predictive modeling on Census Data with IPython notebook / pandas / scikit-learn • Combining models with Pipeline and parameter search
Predictive modeling ~= machine learning • Make predictions of outcome
on new data • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model • Alternative to hard-coded rules written by experts
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train)
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train) Apartment 2 33 TRUE House 4 210 TRUE samples (test) ? ?
Training text docs images sounds transactions Labels Machine Learning Algorithm
Model Predictive Modeling Data Flow Feature vectors
New text doc image sound transaction Model Expected Label Predictive
Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
Inventory forecasting & trends detection Predictive modeling in the wild
Personalized radios Fraud detection Virality and readers engagement Predictive maintenance Personality matching
• Library of Machine Learning algorithms • Focus on established
methods (e.g. ESL-II) • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train) y_pred = model.predict(X_test)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred)
Support Vector Machine from sklearn.svm import SVC model = SVC(kernel="rbf",
C=1.0, gamma=1e-4) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Linear Classifier from sklearn.linear_model import SGDClassifier model = SGDClassifier(alpha=1e-4, penalty="elasticnet")
model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Random Forests from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200) model.fit(X_train,
y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
None
None
Demo time! http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/ master/sklearn_demos/Income%20classification.ipynb https://github.com/ogrisel/notebooks
Combining Models from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train)
Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA from
sklearn.svm import SVC from sklearn.pipeline import make_pipeline pipeline = make_pipeline( StandardScaler(), RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), ) pipeline.fit(X_train, y_train)
Scoring manually stacked models scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)
pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train) X_test_scaled = scaler.transform(X_test) X_test_pca = pca.transform(X_test_scaled) y_pred = svm.predict(X_test_pca) accuracy_score(y_test, y_pred)
Scoring a pipeline pipeline = make_pipeline( RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), )
pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) accuracy_score(y_test, y_pred)
Parameter search import numpy as np from sklearn.grid_search import RandomizedSearchCV
params = { 'randomizedpca__n_components': [5, 10, 20], 'svc__C': np.logspace(-3, 3, 7), 'svc__gamma': np.logspace(-6, 0, 7), } search = RandomizedSearchCV(pipeline, params, n_iter=30, cv=5) search.fit(X_train, y_train) # search.best_params_, search.grid_scores_
Thank you! • http://scikit-learn.org • https://github.com/scikit-learn/scikit-learn @ogrisel