Introduction to Machine Learning for Data Science

Workshop Eps. Introduction to Machine Learning for Data Science 29th
of June, 2020, Purwadhika Classroom @Hangout Google Hello data geeks!

greetings! just call me Fiqry (with/without sufﬁx) I currently working
as a Data Scientist @Bukalapak, also had been working as Technical Content Reviewer @Packt Publishing (working remotely) I also passionate on Time Series Analytics, Immersive Computing (VR & AR), and Gamiﬁcation Business. That’s it ya!

disclaimer

Please expect these things: Everything will be delivered in introduction
level, please don’t expect you will be an expert Data Scientist/ML Engineer after attending this workshop There will be a hands-on session, coding with python language. If you are not familiar with python (Or don’t even have any programming background), please expect you just understand the syntax selection Please expect that there won’t be any content repetition during the workshop once you were disconnected, you are able to rewind the material later using recorded video by Purwadhika No repetition For Disconnect Accident Know the Workshop Scope Programming Language using Python

This talk will be conducted in Three Different levels: Talk
in various spectrums, from Technology, Business, Sociology, Economy, to widen and enlarge the point of view and paradigm 100% Theory, 0% Practice @ 1 hour (Modul 1) Talk speciﬁc in one or two domains, describe the process from the upstream to downstream, and a bit of coding 50% Theory, 50% Practice @ 1 hour (Modul 2) Talk fundamentally in one domain, answering “how” question, get ready to make hands dirty 0% Theory, 100% Practice @ 1 hour (Modul 3) Low-Level High-Level Med-Level

Bookmark The Star Slides that have a star on the
top left corner, is very important to give more focus

T-Shape Outcome This workshop expect you tobe be a T-shape
person in Data Science Industry

“In God we trust. All others must bring data” —W.
Edwards Deming

Dismantle Machine Learning Engine Table of contents Low-level discussion: Get
your hands dirty on data forecasting and predictive modelling. High-level discussion: How to get inspired by data science in a different perspective. Data Science, an unpredictable tale Med-level discussion: An end-to-end process on how does machine learning works. Try to code. Hands-On real industry case 01 02 03

High-level discussion: How to get inspired by data science in
a different perspective Data Science, an unpredictable tale 01

Data Science is changing the world, would you mind to
take a part of it?

Med-Level Discussion: An end-to-end process on how does machine learning
works. Try to code. Dismantle the Machine Learning Engine 02

Dismantle the Machine Learning Engine: Understanding Level of Data Science
Understand the differences between Supervised, Unsupervised, Deep Learning. Know how to determine and present best model. Early level of EDA (Exploratory Data Analysis) Understand how to increase model accuracy, handle data problems (imbalance data, missing value), and proﬁcient in model selections. Able to do Feature Engineering. Expert level of increasing model accuracy (New Deep Learning Arch), very proﬁcient on handling data problems, able to propose new algorithm in different data case Entry medior Senior Managerial Route Technical Route

Dismantle the Machine Learning Engine: Things to Touch and Say
Hello with Greetings to Machine Learning types, such as Supervised Learning, Unsupervised Learning Get in touch the end-to-end process of doing Machine Learning things, along with the most recent tools Understand the mechanism of data behavior and model selection by its metrics Data, Metrics, and Model Selection Say hello to Machine Learning Tools and End-to-End Process

Variable a medium to store data/information/value Data is a value/information,
can be numeric/alphabetic/picture/video/etc Machine Learning Model a simpliﬁed program that can be taught data (input) to predict output Common used Terms

Variable X open variable that used to predict Y (independent/feature
variable) Variable Y label variable that determined by X (dependent/response variable) Common used Terms

Algorithm a sequence of steps to solve problem Train &
Testing Data Train data is data that used to Train ML Model Testing data is data that used to Test ML Model Accuracy Feature Engineering a speciﬁc domain to produce new features based on existing features Common used Terms

Say Hello to Machine Learning!

What do you think about Machine Learning?

Machine that can makes own decision Machine that learn from
data Machine that can predict data living computer algorithm math and stats stuff

Say Hello to Machine Learning: Definition and Its Derivatives Input
(Data) Static Code/Syntax Output (Data) Traditional Programming Input (Data) + Output (Data) Train ML Model (Learn) ML Model (Program) Prediction (from Input) Machine Learning Programming

https://christophm.github.io/interpretable-ml-book/terminology.html Normal Programming vs Machine Learning

Say Hello to Machine Learning: Definition and Its Derivatives Input:
Height = [145,154,177,150,170] Static Program: IF Logic IF Height < 150 then Short IF Height >= 150 or Height <= 175 then Average IF Height > 175 then Tall Output: [Short, Average, Tall, Average, Average] Input: Height = [145,154,177,150,170] Classification_Label = [Short, Average, Tall, Average, Average] Machine Learning Algorithm: Train the Model using Height and Classification_Label Prediction using Trained ML Model: New Height = 190 New Classification Label = Tall Traditional Programming Machine Learning Programming

Say Hello to Machine Learning: Machine Learning Venn Diagram

Say Hello to Machine Learning: Definition and Its Derivatives Supervised
Machine Learning Unsupervised Machine Learning Teach ML Model by using Predictor Variable X and Label Variable Y Teach ML Model by using Predictor Variable X only, let the model predict the Label Variable Y

Say Hello to Machine Learning: Supervised Machine Learning Supervised Machine
Learning

Say Hello to Machine Learning: Supervised Machine Learning COW Group
A COW Group B COW Group C COW DATA (Height, Color, Weight) COW CLASS (A,B,C) “Supervised” ML MODEL

Say Hello to Machine Learning: Supervised Machine Learning

Data, Metrics, and Model Selection supervised Machine Learning Popular Algorithm
Supervised Machine Learning Linear Regression XGBoost (XGB) Random Forest (RF) Support Vector Machine (SVM)

Say Hello to Machine Learning: Supervised Machine Learning Regression Dynamic
Pricing (Surge Price) House Price Prediction Classification Captcha Security Email Spam Filtering

Say Hello to Machine Learning: UnSupervised Machine Learning Unsupervised Machine
Learning

Say Hello to Machine Learning: Supervised Machine Learning COW Group
A COW Group B COW Group C COW DATA (Height, Color, Weight) COW CLASS (A,B,C) “Unsupervised” ML MODEL

Say Hello to Machine Learning: Unsupervised Machine Learning

Say Hello to Machine Learning: Unsupervised Machine Learning Clustering Dimensional
Reduction

Say Hello to Machine Learning: Unsupervised Machine Learning Clustering User
Segmentation Dimensional Reduction User Segmentation

Data, Metrics, and Model Selection Unsupervised Machine Learning Popular Algorithm
Unsupervised Machine Learning Hierarchical Clustering t-SNE K-Means Principal Component Analysis (PCA)

Say Hello to Machine Learning: UnSupervised Machine Learning RECAP Supervised
& Unsupervised

Say Hello to Machine Learning: Supervised and Unsupervised Recap

Say Hello to Machine Learning: Machine learning Model space

Tools and End-to-End Process

Tools and End-to-End Process: Data Science Workflow in industry Adjustment
of model weight matrices to be stored in microservice, create an architecture workﬂow to be data pipeline, ready to deploy 3.Adjustment and Deployment 25% Utilizing machine learning algorithm to build an automation, also evaluating the built model accuracy 2.Modeling and Evaluation 25% Starts from collecting data, preprocessing, and doing exploratory data analysis 1.Ingestion and Analysis 50%

Tools and End-to-End Process: Step 1 - Ingestion and Analysis
5 Analysis and Visualization Making analysis from the preprocessed data, drive and proof the research hypothesis rightness by visualize it by some graphs or descriptive one 3 Data Retrieval Retrieving data from query schema, could be from data warehouse, or scraping from the internet 1 Research Hypothesis Conducting research ﬂow along with the hypotheses that might solve the problems 4 Data Preprocessing Cleaning the whole data, such as control the outlier, transform or standardize, null values handling, etc. 2 Data Query Schema Determine which data to take, which tables, which features, etc.

Tools and End-to-End Process: Step 2 - Modeling and Evaluation
5 M odel Selection Select best m odel by highest accuracy/interpretation 4 M odel Evaluation Evaluate M odel Accuracy using Test Data 3 Train M achine Learning M odel Train M L M odel by Train Data Feature Engineering Produce new features by existing feature 2 1 Research M ethodology Choose a Proper M L Algorithm to Research Objective

Tools and End-to-End Process: Step 3 - Adjustment and Deployment
Production 1 2 Deployment to Production Store the model weight matrices into container that runs their requirements and dependencies Ensure the model pipeline runs smoothly from upstream to downstream Adjustment and Communication

Data, Metrics, and Model Selection

Data, Metrics, and Model Selection Data, and its derivative DATA
NUMERICAL (numbers) cATEGORICAL (text,alphabetic) Discrete [0, 1, 2, 3, 4, … N] CONTINUOUS [0.1,0.001,...,1] NOMINAL no hierarchy [gender, address] ORDINAL have hierarchy [education level] No Encoding

Image, Video Sound Need Encoding

Problems Missing Value Duplicated High-Value Gap Imbalanced

Data, Metrics, and Model Selection Data, and its derivative Missing
Value NA = Not Available. Might be NULL, NaN or etc. Global symbol of missing value

Value Substitute with Statistical Ways: Mean | Median | Mode (Most Frequent) Drop Columns/Rows from Missing Values 1 2

Value AVG = 20.25 AVG = 53.5 AVG = 74

Value

Data, Metrics, and Model Selection Data, and its derivative Duplicated
Just Remove it!

Data, Metrics, and Model Selection Data, and its derivative High
Value Gap Target Variable Predictor Variable Task: Predict the competition winner by given a set of predictor variables and target variable!

Value Gap Normalization Technique Normalizing will ensure that a convergence problem does not have a massive variance, making optimization feasible.

Value Gap Normalization Technique Normalizing will ensure that a convergence problem does not have a massive variance, making optimization feasible. min-max scaler normal distribution

Data, Metrics, and Model Selection Data, and its derivative Imbalanced

Data, Metrics, and Model Selection Data, and its derivative Train:
part on which your ML algorithms are actually trained to build a model (60% of your data) Validation: to validate our various model ﬁts (20% of your data) Test: to test our model hypothesis. left untouched and unseen until the model and hyperparameters are decided (20% of your data) Train Test Split

Data, Metrics, and Model Selection Metrics, and its derivative ML
Metrics NUMERICAL (numbers) cATEGORICAL (text,alphabetic) Distance-based (show condition) RMSE, MAE, .. percentage (interpretable) MAPE, R2, ... Interpretability (Meaningful) Precision, Recall, .. Reliability (Stable) AUC, ROC, ... Based on Target Variable

Data, Metrics, and Model Selection Metrics, and its derivative Numerical
MAPE = Example: Model A evaluation: MAPE = 7.9% ~ The model is only wrong 7.9% to predict Y R2 = 88% ~ The given Variable X could precisely (88%) illustrate the variance of variable target (Y) Distance Principal “The lower, the better” Percentage Principal “Have a rule”

Data, Metrics, and Model Selection Metrics, and its derivative Categorical
Case: No pregnancy (event), A person (man/woman) False Positive (FP): Predict an event when there is no event (bad) False Negative (FN): Predict no event when there is an event (bad) True Positive (TP): Predict an event when there is an event (good) True Negative (TN): Predict no event when there is no event (good) Event: Pregnancy Logic: - Man can’t pregnant - Woman can pregnant FP: ML Model predict man pregnant FN: ML Model predict woman not pregnant (but the reality is pregnant)

Case: No pregnancy (event), A person (man/woman) False Positive (FP):
Man is pregnant, actual is not pregnant False Negative (FN): Woman is not pregnant, actual is pregnant True Positive (TP): Man is not pregnant, actual is not pregnant True Negative (TN): Woman is pregnant, actual is pregnant Data, Metrics, and Model Selection Metrics, and its derivative Categorical Precision = TP / (TP+FP) → Stay Aggressive Recall = TP / (TP + FN) → Stay Careful

Precision: We want ML model could predict an event with
aggressively (event exist or not exist, our prediction must predict an event is exist) Recall: We want ML model could predict an event carefully (better if not to predict an event, rather than wrong prediction) ---------------------------------------------------------------------------------------------- Another Example: A rain prediction, A man False Positive (FP): A man told to bring an umbrella, but the actual is no rain in the whole day. False Negative (FN): A man told not to bring an umbrella. but the actual is rain in the whole day If you a businessman, which risk will you minimize ﬁrst ? FP/FN? Data, Metrics, and Model Selection Metrics, and its derivative Categorical Precision = TP / (TP+FP) → Stay Aggressive Recall = TP / (TP + FN) → Stay Careful

Data, Metrics, and Model Selection Metrics, and its derivative Precision:
We want ML model could predict an event with aggressively (event exist or not exist, our prediction must be correct) Recall: We want ML model could predict an event carefully (better if not to predict an event, rather than wrong prediction) ---------------------------------------------------------------------------------------------- Another Example: An email, A spam flagger False Positive (FP): An email is flagged as spam by system, but the actual is not a spam message overall False Negative (FN): An email is not flagged as spam by system, but the actual is really spam and full of phishing links If you a businessman, which risk will you minimize first ? FP/FN? Categorical Precision = TP / (TP+FP) → Stay Aggressive Recall = TP / (TP + FN) → Stay Careful

Data, Metrics, and Model Selection Metrics, and its derivative Categorical
AUC Score: Better if it is approaching 1.0 *Best metric to the desribe model reliability (imbalanced dataset)

Data, Metrics, and Model Selection Model Selection, and its derivative
Model Selection Underfitting Overfitting High Bias Low Variance Low Bias High Variance Based on Bias-Variance Tradeoff

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Bias Variance Tradeoff

Bias Variance Tradeoff

Underfitting - Model unable to capture the underlying pattern of the data - High bias, Low variance - Usually less amount of data train - or, model is too simple and has very few parameters

Underfitting - Model captures the noise along with the underlying pattern in data - Low bias, High variance - Have a lot over noisy dataset - or, model is too complex, and has many parameters

Confused by Concepts and Theories? Let’s code!

Special Part: How to be The most wanted data scientist

Special Parts: some notes of How to be an expert
data scientist in this era Interpretable ML is good, but most importantly the explainable one. (This skillset is one of most prospective ﬁelds of ML) Data Science is an iterative process. Everyone could be a DS as long as they follow the guided process. If you want to be different, show your domain expertise. Having a lot of learning process is awesome, but most importantly show your side project/analysis impact which can be calculated and be a strong prove of Data Scientist. Throne your impact, not your certificate(s) Take a serious focus on Explainable AI Know your Domain Science

Special Parts: some notes of How to be an expert
data scientist in this era Recommended Book/Course Udacity Intro to ML Coursera ML E-Book

“Without data, you're just another person with an opinion” —W.
Edwards Deming

Thanks you! Contact me for further questions Linkedin: Fiqry Revadiansyah
Telegram: @ﬁqryr

Introduction to Machine Learning for Data Science

Introduction to Machine Learning for Data Science

More Decks by Fiqry Revadiansyah

Other Decks in Technology

Featured

Transcript