Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pitfalls in Data Science Projects (and how to a...

Pitfalls in Data Science Projects (and how to avoid them)

Slides for my talk given at the 9th PyData Southampton meetup (September 2024)
https://www.meetup.com/pydata-southampton/events/303138206

Abstract:
This talk discusses the successful delivery of data science projects, by analysing common mistakes that can hamper your plans and hurt your chances of success. It’s commonly acknowledged that a huge proportion of data science projects fail, for a variety of reasons. By sharing some lessons learned the hard way, we’ll look at frequent anti-patterns and how to elude them.

This presentation is mainly aimed at early-career data scientists, but also at anyone who has seen their data science projects going south and would like to improve the outcome of their work.

Marco Bonzanini

September 18, 2024
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Nice to meet you

    • Dr Marco Bonzanini • NLP and Data Science stuff • Consulting, training and coaching on Python + Data Science • Former Chair @ PyData London 2
  2. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 8 Business and Tech

    Out of sync Shiny Object Syndrome Data Quality No Route to Deployment
  3. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 9 Business and Tech

    Out of sync Shiny Object Syndrome Data Quality No Route to Deployment
  4. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Impact vs Effort 19

    Effort Impact High impact Low effort 👍 👍 Low impact High effort 👎 👎
  5. © Bonzanini Consulting Ltd — BonzaniniConsulting.com Impact vs Effort 20

    Effort Impact ✅ ✅ ✅ ❌ ❌ ❌ ❌ 🤷 ✅ 🤷 🤷
  6. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 23 Align with Business

    De-risk • Business case: PoC vs Proof-of-Value • Stakeholders: buy-in + involvement
  7. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 24 Align with Business

    De-risk • Business case: PoC vs Proof-of-Value • Stakeholders: buy-in + involvement • Data availability: fi rst mile data + additional data
  8. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 25 Align with Business

    De-risk • Business case: PoC vs Proof-of-Value • Stakeholders: buy-in + involvement • Data availability: fi rst mile data + additional data • Data quality, coverage and volume?
  9. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 26 Align with Business

    De-risk • Business case: PoC vs Proof-of-Value • Stakeholders: buy-in + involvement • Data availability: fi rst mile data + additional data • Data quality, coverage and volume? • Deployment, integration, scalability?
  10. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 28 Data Validation Data

    Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint
  11. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 29 Data Validation Data

    Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint “Let’s get the best accuracy”
  12. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 30 Data Validation Data

    Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint “Let’s get the best accuracy” • Feature engineering • Model building • Hyperparameter tuning • […]
  13. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 34 End-to-end Quickly Iterative

    Improvement • Promote early feedback • Avoid early complexity: - Dif fi cult to diagnose - Delay feedback - Hide bigger risks
  14. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 35 End-to-end Quickly Iterative

    Improvement • Promote early feedback • Avoid early complexity: - Dif fi cult to diagnose - Delay feedback - Hide bigger risks • Optimise for business value, not “accuracy”
  15. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 37 Data Validation Data

    Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint
  16. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 38 Data Validation Data

    Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint
  17. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 39 Data Validation Data

    Processing Model Training Model Validation Deployment Raw Data Model Serving Endpoint What got you here
 won’t get you there
  18. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 43 Testing Packaging •

    Don’t ignore good software engineering principles
  19. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 44 Testing Packaging •

    Don’t ignore good software engineering principles • Testing: unit testing, integration testing
  20. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 45 Testing Packaging •

    Don’t ignore good software engineering principles • Testing: unit testing, integration testing • Code re-usability, DRY (Don’t Repeat Yourself) Single Responsibility Principle
  21. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 46 Testing Packaging •

    Don’t ignore good software engineering principles • Testing: unit testing, integration testing • Code re-usability, DRY (Don’t Repeat Yourself) Single Responsibility Principle • Ditch the notebooks as soon as you: - struggle testing some component - would like to “import from another notebook”
  22. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 50 Code Reviews Integration

    • Code reviews done right: foster collaboration - Spot check errors - Clarify the why’s - Knowledge share
  23. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 51 Code Reviews Integration

    • Code reviews done right: foster collaboration - Spot check errors - Clarify the why’s - Knowledge share • Code reviews done wrong: hostile environment - Ask open ended questions - Use professional language - Nitpicks: labelled as such, kept to a minimum
  24. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 55 The First Mile

    Rarely a technology problem Usually a planning + communication problem
  25. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 56 The First Mile

    Rarely a technology problem Usually a planning + communication problem The Last Mile Sometimes a technology problem Still a planning + communication problem
  26. © Bonzanini Consulting Ltd — BonzaniniConsulting.com 57 Thank You •

    Linkedin https://www.linkedin.com/in/marcobonzanini/ • Blog: marcobonzanini.com • Newsletter: marcobonzanini.com/newsletter