Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2025-05-31-pycon_italia

 2025-05-31-pycon_italia

Avatar for Sofie Van Landeghem

Sofie Van Landeghem

May 31, 2025
Tweet

More Decks by Sofie Van Landeghem

Other Decks in Science

Transcript

  1. Data doesn't lie, but it can mislead How to ensure

    integrity of your ML applications
  2. Date Race Winner 4 May 2025 Miami Oscar Piastri 18

    May 2025 Emilia Romagna Max Verstappen 25 May 2025 Monaco Lando Norris Link facts to given news articles
  3. Date Race Winner 28 July 2024 Spa Lewis Hamilton Date

    Race Winner 28 July 2024 Spa George Russell 28 July 2024 Spa Lewis Hamilton NLP Filtering
  4. Episode "Looking Out for Number 1" A narrative about George

    Russell "stepping up" into a leadership role at Mercedes … narrating the Spa 2024 Grand Prix … … showing Russell on the podium … … and never even mentioning his disqualification.
  5. Obtaining gold-standard data Sir Lewis Carl Davidson Hamilton (born 7

    January 1985) is a British [[racing driver]] who competes in [[Formula One]] for [[Scuderia Ferrari|Ferrari]].
  6. How most clients / users think about performance 0% 50%

    100% The worst model "Random guessing" The best model "Entirely correct"
  7. In fact, we had to prune the database for efficiency

    requirements ❖ Kept only 14% of all Wikipedia concepts ❖ An "oracle" disambiguation obtains 84% F-score So, the modeling work targets a 0% - 84% range
  8. In our case, we won't assign "Hamilton" to a random

    page out of the 7M available ones
  9. Now, if we pick a random one from this candidate

    list, we actually obtain 54% F-score (without doing any ML at all!)
  10. Consider "prior probabilities": a measure of how often a mention

    is linked to a certain concept Textual mention F1 race driver 18th century military officer Duke of Hamilton "Hamilton" 75% 7% 10% "Lewis Hamilton" 99% 0% 0% "Alexander Hamilton" 0% 97% 1%
  11. Now extrapolate them to the extreme and use this as

    "predictions": Textual mention F1 race driver 18th century military officer Duke of Hamilton "Hamilton" 100% 0% 0% "Lewis Hamilton" 100% 0% 0% "Alexander Hamilton" 0% 100% 0%
  12. The ML model that is supposed to disambiguate based on

    context (79%) only marginally improves upon a relatively simple baseline (78.2%)
  13. Let's revisit our upper bound (again). Is it really 100%

    if we have no memory / efficiency requirements? 0 100% The best model "Entirely correct"
  14. When multiple annotators label the same data sample, how often

    do they agree? ↪ Inter-annotator agreement = IAA
  15. Annotators mostly agree → the gold data is reliable and

    robust Annotators disagree because… ★ The data is confusing ★ The label scheme is ambiguous ★ The NLP task is too complex High IAA Low IAA
  16. We need to label incoming articles for our portal 1.

    Sports a. Football b. F1 2. Politics a. Global politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature
  17. Let's plot IAA of the labels… Football F1 Global politics

    US Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0
  18. Confusion between global and US politics? Football F1 Global politics

    US Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0
  19. Confusion between politics and F1? Football F1 Global politics US

    Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0
  20. Solution: allow multiple correct labels 1. Sports a. Football b.

    F1 2. Politics a. Global politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature
  21. Confusion within the sports category? Football F1 Global politics US

    Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0
  22. 1. Sports a. Football b. F1 2. Politics a. Global

    politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature Solution: critically revise the label scheme given to you! Inherent ambiguity in the label scheme needs to be addressed
  23. Prediction Gold ≠ Is this really an incorrect prediction (a

    "false positive")? Or is the label wrongly annotated? Perhaps both answers can be seen as "correct"?
  24. Manually annotated training dataset: "... to kick off 2025 [Dutch

    Grand Prix] weekend in Zandvoort" "Gasly laments ‘quite sad’ [Monaco] GP crash" Sample prediction: "Gasly laments ‘quite sad’ [Monaco GP] crash"
  25. Always include the words "Grand Prix" or "GP" into the

    entity annotation: ❌ Gasly laments ‘quite sad’ [Monaco] GP crash ✅ Gasly laments ‘quite sad’ [Monaco GP] crash Write up annotation guidelines to help with consistency of annotations
  26. Is your data correct, consistent and robust? Is your test

    data representative of real-world data? Does your evaluation capture the value of the algorithm in production?
  27. Only after defining the right hill to climb, you can

    get to the fun stuff: ★ Algorithm development ★ Model training ★ LLM fine-tuning ★ …
  28. ML project: data annotation ✓ Ensure your label scheme is

    consistent and unambiguous ✓ Draft clear annotation guidelines to ensure data consistency ✓ Measure inter-annotator agreement (IAA) ✓ Consider reframing your task/guidelines if the IAA is low ✓ Model uncertainty in your annotation workflow 📝 1/3
  29. ML project: performance evaluation ✓ Develop simple baselines to put

    performance into perspective ✓ Quantify realistic upper/lower performance bounds ✓ Measure performance as part of the larger business process 📝 2/3
  30. ML project: performance evaluation ✓ Identify structural data errors by

    "predicting" the training data ✓ Apply to truly unseen data to measure realistic performance ✓ Make sure you’re climbing the right hill 📝 3/3