Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science Methodology

Data Science Methodology

A gentle introduction of Data Science Methodology, along with frameworks explanation that commonly used on current tech companies.

Fiqry Revadiansyah

June 28, 2020
Tweet

More Decks by Fiqry Revadiansyah

Other Decks in Technology

Transcript

  1. Trainer Bio Data Scientist Bukalapak (2018 – Present) Technical Content

    Reviewer Packt Publishing (2019 – Present) Email : [email protected] | Linkedin : https://www.linkedin.com/in/fiqryrevadiansyah Fiqry Revadiansyah Work Experience Teaching Experience Guest Lecturer MB1201 Business Statistics at SBM ITB (April 2020) Part-time Teacher Data Science Purwadhika (2019), and Workshop speaker: Introduction to ML for DS (June 2020) & Statistics for Business Analytics (Nov 2019)
  2. So, this is not a planting class, right? What is

    the relation to Data Science?
  3. Analogy of Farming The Data Science Methodology is an iterative

    system of methods that guides data scientists on the ideal approach to solving problems with data science, through a prescribed sequence of steps.
  4. Data Science Methodology 1. Define the Goal, and Choose the

    Road 2. Gather the resource, and Identify its Characteristic 3. Do the Plan, and Evaluate it 4. Market to Public, and Regather the Opinion Preparation Stage Data Related Stage Drive the Solution Stage
  5. 1. Business Understanding What problem do you want to take

    action? As a farmer - Gain More Profit - Live Healthy As a Data Scientist - Increase Visitor Numbers - Reduce Fraudulent Acts
  6. 1. Business Understanding What problem do you want to take

    action? Impression • Number of Visit • Number of Active Visitors Budget • Cost Per Product • Cost of Acquisition Users • Number of Retain Users • Satisfaction Score
  7. 2. Analytic Approach Which lane do you prefer to take?

    As a farmer Choose planting method: Hydroponics or Aquaponics As a Data Scientist Choose analytics method: Predictive Modeling (ML) or Diagnostic Analysis
  8. 2. Analytic Approach Which lane do you prefer to take?

    If the question is to determine probabilities of an action • Use a Predictive model If the question is to show relationships • Use a Descriptive model If the question requires a yes/no answer • Use a Classification model
  9. 3. Data Requirements What kind of resource do you need?

    As a farmer Prepare the tools & seeds to achieve the farming goal As a Data Scientist Choose what kind of data that might solve the problem
  10. 4. Data Collections How do you collect the resource? As

    a farmer Select shop to purchase seeds, either local/online As a Data Scientist Determine and Collect the data from data source
  11. 4. Data Collections How do you collect the resource? Local/Internal

    • User Data • Traffic History Data Public/External • Open Data Repository Scraping • Social Media • Website
  12. 5. Data Understanding What have the resource tell you? As

    a farmer Get to know what is happening on the soil, plants, and the environment As a Data Scientist Get to know, what does the data tell us about the problem, and visualize it
  13. 5. Data Understanding What have the resource tell you? Data

    Visualization Discover data trend, pattern, and any other relevancies accordingly Descriptive Statistics Decipher the aggregate information, such as average median, mean, missing value, etc Funnel Analysis Uncover the hidden information
  14. 6. Data Preparation What have to do before doing an

    action? As a farmer Prepare the suitable soil for the selected plants, set the growing medium well As a Data Scientist Handle data problem, such as missing values, duplicates, and other
  15. 6. Data Preparation What have to do before doing an

    action? Missing Values Duplicated Data Irregular Format Imbalanced Clean Data
  16. 7. Modeling How do you make a model from your

    data to solve the problem? As a farmer Planting the seeds, watering the plants, etc. As a Data Scientist Model the data by Machine Learning Process
  17. 7. Modeling How do you make a model from your

    data to solve the problem? Choose ML Model Determine the model based on expected output (prediction or regression) Model Iteration Iterate the modeling process by K- fold Ensemble Model Combine ML models to gain better model accuracy
  18. 8. Evaluation Have the model already answer the problem or

    need to be improved? As a farmer Inspect the plants, is it free from pest/disease? As a Data Scientist Do the model has good fitting accuracy, should it be enchanted?
  19. 8. Evaluation Have the model already answer the problem or

    need to be improved? Model Interpretation Interpret the model result to be understood by other people Model Evaluation Validate the model performance by its problem type (Accuracy, Precision, Recall, RMSE, etc)
  20. 9. Deployment Can you apply the model to the real

    life? As a farmer Sell the vegetables to the market/store As a Data Scientist Integrate ML model into production ecosystem
  21. 10. Feedback Is there any input to your business solution?

    As a farmer Gather suggestions/ comments from our customer As a Data Scientist Take many feedbacks from various entity such as end- user, stakeholders, etc.
  22. 10. Feedback Is there any input to your business solution?

    METRICS BEFORE AFTER DEPLOYMENT ML MODEL Daily Active Users 1000 1600 (+60%) Cost Spent 1 mio/month 500k/month (-50%) Revenue Gain 10 mio 30 mio (+300%) SLA 3 days 2 days (-33%)
  23. In short… Data Science Methodology are… Define the Goal, and

    Choose the Road • What problem do you want to take action? • Which lane do you prefer to take? Gather the resource, and Identify its Characteristic • What kind of resource do you need? • How do you collect the resource? • What have the resource tell you? • What have to do before doing an action? Do the Plan, and Evaluate it • How do you make a model from your data to solve the problem? • Have the model already answer the problem or need to be improved? Market to Public, and Regather the Opinion • Can you apply the model to the real life? • Is there any input to your business solution?
  24. Lets take a simple exercise Youtube Case Define the Goal,

    and Choose the Road • What problem do you want to take action? • Which lane do you prefer to take? Gather the resource, and Identify its Characteristic • What kind of resource do you need? • How do you collect the resource? • What have the resource tell you? • What have to do before doing an action? Do the Plan, and Evaluate it • How do you make a model from your data to solve the problem? • Have the model already answer the problem or need to be improved? Market to Public, and Regather the Opinion • Can you apply the model to the real life? • Is there any input to your business solution?
  25. Lets take a simple exercise Jenius Case Define the Goal,

    and Choose the Road • What problem do you want to take action? • Which lane do you prefer to take? Gather the resource, and Identify its Characteristic • What kind of resource do you need? • How do you collect the resource? • What have the resource tell you? • What have to do before doing an action? Do the Plan, and Evaluate it • How do you make a model from your data to solve the problem? • Have the model already answer the problem or need to be improved? Market to Public, and Regather the Opinion • Can you apply the model to the real life? • Is there any input to your business solution?
  26. Lets take a simple exercise Gojek Case Define the Goal,

    and Choose the Road • What problem do you want to take action? • Which lane do you prefer to take? Gather the resource, and Identify its Characteristic • What kind of resource do you need? • How do you collect the resource? • What have the resource tell you? • What have to do before doing an action? Do the Plan, and Evaluate it • How do you make a model from your data to solve the problem? • Have the model already answer the problem or need to be improved? Market to Public, and Regather the Opinion • Can you apply the model to the real life? • Is there any input to your business solution?
  27. Product Management AAARRR Model Journey Prototype Describe an end-to-end process

    of how our customer get impression until producing revenue stream Develop Persistently Focus on dropping channel, constantly evaluate with the whole complexity (engineer, ux, data, etc) Data Driven! They evaluate those channel based on data. Analytics is needed to enhance the decision making process here.
  28. Product Design Double Diamond Design Thinking Careful on Iteration A

    belief to double check progress, start from a helicopter view, end to the ant view. Exploration to Action Focus on explore the situation first, define hypothesis based on pain points, develop product to solve, deliver to evaluate Data Driven! From the beginning till the end, they use data to tell the story about our customers
  29. Business Development Business Model Canvas Plan on Demand List down

    all funnels of business process, set the subjects of every key points to ensure reliability Customer Satisfaction Aside from the streams, this model also focus on customer growth, such how to maintain the relationship, how to segment them Data Driven! Data is always needed to recap every key points of this model
  30. Engineering Agile Project Management Iterative Approach Managing software development projects

    that focuses on continuous releases and incorporating customer feedback with every iteration Scrum and Kanban Scrum is focused on fixed-length project iterations, Kanban is focused on continuous releases. Data Driven! In order to track the process, data is needed to evaluate the process
  31. Data Driven Framework Data on Stakeholder Expectations “Okay, so what?”

    “OK, Thanks” “Seems interesting” “Good, I think we need to do an AB Test” “Impressive! Lets do A, B, C tomorrow!”
  32. Data utilization through stakeholders expectation “Okay, so what?” “The Actionless

    Data” When you give data without any particular advantages (rough format) User Table Transaction Table
  33. Data utilization through stakeholders expectation “OK, Thanks” “The Labeled Data”

    Data with particular group/label, an aggregated information. Daily Transacting Users The average of Production Cost on June 2020
  34. Data utilization through stakeholders expectation “Seems interesting” “The Data Combination”

    Interconnected labelled-data brings better point of view. Count Transaction by User Location Conversion Rate (CVR) of Voucher X
  35. Data utilization through stakeholders expectation “Good, I think we need

    to do an AB Test” “The Most Significant Data” Aggregated/Interconnected data, which acted as main metrics of the company Retention Cohort of Total Customer from city X Budget Allocation of Product X based on Customer Group
  36. Data utilization through stakeholders expectation “Impressive! Lets do A, B,

    C tomorrow!” “The Data Guru” Interconnected of most significant findings, as a funnel to answer missing gap (Funnel Type) Retention Cohort based on User Type, Country of Origin, etc Combination of Product Type, Location, and Revenue
  37. Product Management Awareness Acquisition Activation Retention Referral Revenue Trending Topics

    Content Uniqueness Content Quality Content Stickiness Gift Away Ads
  38. Product Management Awareness Acquisition Activation Retention Referral Revenue Impression Trial

    Register Flexi Cash Monthly User Gift Promotions Interest Percentage