A gentle introduction to Machine Learning, consist of Supervised & Unsupervised ML, End-to-End Process of ML, Data Preprocessing and Feature Engineering, and also Evaluation Metrics for ML. This is a workshop session.
as a Data Scientist @Bukalapak, also had been working as Technical Content Reviewer @Packt Publishing (working remotely) I also passionate on Time Series Analytics, Immersive Computing (VR & AR), and Gamification Business. That’s it ya!
level, please don’t expect you will be an expert Data Scientist/ML Engineer after attending this workshop There will be a hands-on session, coding with python language. If you are not familiar with python (Or don’t even have any programming background), please expect you just understand the syntax selection Please expect that there won’t be any content repetition during the workshop once you were disconnected, you are able to rewind the material later using recorded video by Purwadhika No repetition For Disconnect Accident Know the Workshop Scope Programming Language using Python
in various spectrums, from Technology, Business, Sociology, Economy, to widen and enlarge the point of view and paradigm 100% Theory, 0% Practice @ 1 hour (Modul 1) Talk specific in one or two domains, describe the process from the upstream to downstream, and a bit of coding 50% Theory, 50% Practice @ 1 hour (Modul 2) Talk fundamentally in one domain, answering “how” question, get ready to make hands dirty 0% Theory, 100% Practice @ 1 hour (Modul 3) Low-Level High-Level Med-Level
your hands dirty on data forecasting and predictive modelling. High-level discussion: How to get inspired by data science in a different perspective. Data Science, an unpredictable tale Med-level discussion: An end-to-end process on how does machine learning works. Try to code. Hands-On real industry case 01 02 03
Understand the differences between Supervised, Unsupervised, Deep Learning. Know how to determine and present best model. Early level of EDA (Exploratory Data Analysis) Understand how to increase model accuracy, handle data problems (imbalance data, missing value), and proficient in model selections. Able to do Feature Engineering. Expert level of increasing model accuracy (New Deep Learning Arch), very proficient on handling data problems, able to propose new algorithm in different data case Entry medior Senior Managerial Route Technical Route
Hello with Greetings to Machine Learning types, such as Supervised Learning, Unsupervised Learning Get in touch the end-to-end process of doing Machine Learning things, along with the most recent tools Understand the mechanism of data behavior and model selection by its metrics Data, Metrics, and Model Selection Say hello to Machine Learning Tools and End-to-End Process
can be numeric/alphabetic/picture/video/etc Machine Learning Model a simplified program that can be taught data (input) to predict output Common used Terms
Testing Data Train data is data that used to Train ML Model Testing data is data that used to Test ML Model Accuracy Feature Engineering a specific domain to produce new features based on existing features Common used Terms
(Data) Static Code/Syntax Output (Data) Traditional Programming Input (Data) + Output (Data) Train ML Model (Learn) ML Model (Program) Prediction (from Input) Machine Learning Programming
Height = [145,154,177,150,170] Static Program: IF Logic IF Height < 150 then Short IF Height >= 150 or Height <= 175 then Average IF Height > 175 then Tall Output: [Short, Average, Tall, Average, Average] Input: Height = [145,154,177,150,170] Classification_Label = [Short, Average, Tall, Average, Average] Machine Learning Algorithm: Train the Model using Height and Classification_Label Prediction using Trained ML Model: New Height = 190 New Classification Label = Tall Traditional Programming Machine Learning Programming
Machine Learning Unsupervised Machine Learning Teach ML Model by using Predictor Variable X and Label Variable Y Teach ML Model by using Predictor Variable X only, let the model predict the Label Variable Y
of model weight matrices to be stored in microservice, create an architecture workflow to be data pipeline, ready to deploy 3.Adjustment and Deployment 25% Utilizing machine learning algorithm to build an automation, also evaluating the built model accuracy 2.Modeling and Evaluation 25% Starts from collecting data, preprocessing, and doing exploratory data analysis 1.Ingestion and Analysis 50%
5 Analysis and Visualization Making analysis from the preprocessed data, drive and proof the research hypothesis rightness by visualize it by some graphs or descriptive one 3 Data Retrieval Retrieving data from query schema, could be from data warehouse, or scraping from the internet 1 Research Hypothesis Conducting research flow along with the hypotheses that might solve the problems 4 Data Preprocessing Cleaning the whole data, such as control the outlier, transform or standardize, null values handling, etc. 2 Data Query Schema Determine which data to take, which tables, which features, etc.
5 M odel Selection Select best m odel by highest accuracy/interpretation 4 M odel Evaluation Evaluate M odel Accuracy using Test Data 3 Train M achine Learning M odel Train M L M odel by Train Data Feature Engineering Produce new features by existing feature 2 1 Research M ethodology Choose a Proper M L Algorithm to Research Objective
Production 1 2 Deployment to Production Store the model weight matrices into container that runs their requirements and dependencies Ensure the model pipeline runs smoothly from upstream to downstream Adjustment and Communication
Value Gap Normalization Technique Normalizing will ensure that a convergence problem does not have a massive variance, making optimization feasible. min-max scaler normal distribution
part on which your ML algorithms are actually trained to build a model (60% of your data) Validation: to validate our various model fits (20% of your data) Test: to test our model hypothesis. left untouched and unseen until the model and hyperparameters are decided (20% of your data) Train Test Split
MAPE = Example: Model A evaluation: MAPE = 7.9% ~ The model is only wrong 7.9% to predict Y R2 = 88% ~ The given Variable X could precisely (88%) illustrate the variance of variable target (Y) Distance Principal “The lower, the better” Percentage Principal “Have a rule”
Case: No pregnancy (event), A person (man/woman) False Positive (FP): Predict an event when there is no event (bad) False Negative (FN): Predict no event when there is an event (bad) True Positive (TP): Predict an event when there is an event (good) True Negative (TN): Predict no event when there is no event (good) Event: Pregnancy Logic: - Man can’t pregnant - Woman can pregnant FP: ML Model predict man pregnant FN: ML Model predict woman not pregnant (but the reality is pregnant)
Man is pregnant, actual is not pregnant False Negative (FN): Woman is not pregnant, actual is pregnant True Positive (TP): Man is not pregnant, actual is not pregnant True Negative (TN): Woman is pregnant, actual is pregnant Data, Metrics, and Model Selection Metrics, and its derivative Categorical Precision = TP / (TP+FP) → Stay Aggressive Recall = TP / (TP + FN) → Stay Careful
aggressively (event exist or not exist, our prediction must predict an event is exist) Recall: We want ML model could predict an event carefully (better if not to predict an event, rather than wrong prediction) ---------------------------------------------------------------------------------------------- Another Example: A rain prediction, A man False Positive (FP): A man told to bring an umbrella, but the actual is no rain in the whole day. False Negative (FN): A man told not to bring an umbrella. but the actual is rain in the whole day If you a businessman, which risk will you minimize first ? FP/FN? Data, Metrics, and Model Selection Metrics, and its derivative Categorical Precision = TP / (TP+FP) → Stay Aggressive Recall = TP / (TP + FN) → Stay Careful
We want ML model could predict an event with aggressively (event exist or not exist, our prediction must be correct) Recall: We want ML model could predict an event carefully (better if not to predict an event, rather than wrong prediction) ---------------------------------------------------------------------------------------------- Another Example: An email, A spam flagger False Positive (FP): An email is flagged as spam by system, but the actual is not a spam message overall False Negative (FN): An email is not flagged as spam by system, but the actual is really spam and full of phishing links If you a businessman, which risk will you minimize first ? FP/FN? Categorical Precision = TP / (TP+FP) → Stay Aggressive Recall = TP / (TP + FN) → Stay Careful
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Bias Variance Tradeoff
Underfitting - Model unable to capture the underlying pattern of the data - High bias, Low variance - Usually less amount of data train - or, model is too simple and has very few parameters
Underfitting - Model captures the noise along with the underlying pattern in data - Low bias, High variance - Have a lot over noisy dataset - or, model is too complex, and has many parameters
data scientist in this era Interpretable ML is good, but most importantly the explainable one. (This skillset is one of most prospective fields of ML) Data Science is an iterative process. Everyone could be a DS as long as they follow the guided process. If you want to be different, show your domain expertise. Having a lot of learning process is awesome, but most importantly show your side project/analysis impact which can be calculated and be a strong prove of Data Scientist. Throne your impact, not your certificate(s) Take a serious focus on Explainable AI Know your Domain Science