2025-05-31-pycon_italia

Data doesn't lie, but it can mislead How to ensure
integrity of your ML applications

Me: Sofie Van Landeghem Open-source contributor Freelance NLP consultant https://oxykodit.com/

Create a portal with all Formula One news

Date Race Winner 4 May 2025 Miami Oscar Piastri 18
May 2025 Emilia Romagna Max Verstappen 25 May 2025 Monaco Lando Norris Link facts to given news articles

Date Race Winner 28 July 2024 Spa George Russell Let's
inspect some sample data

But in reality…

Date Race Winner 28 July 2024 Spa Lewis Hamilton Date
Race Winner 28 July 2024 Spa George Russell 28 July 2024 Spa Lewis Hamilton NLP Filtering

Episode "Looking Out for Number 1" A narrative about George
Russell "stepping up" into a leadership role at Mercedes … narrating the Spa 2024 Grand Prix … … showing Russell on the podium … … and never even mentioning his disqualification.

Always have a domain expert on board who can double
check the data and the results

Portal requirement: link-out to Wikipedia

"Hamilton won the Belgian Grand Prix in Spa in 2024."

Obtaining gold-standard data Sir Lewis Carl Davidson Hamilton (born 7
January 1985) is a British [[racing driver]] who competes in [[Formula One]] for [[Scuderia Ferrari|Ferrari]].

Building an Entity Linker model

✦ 79% F-score ✦

79% is pretty good, right? Right?

How most clients / users think about performance 0% 50%
100% The worst model "Random guessing" The best model "Entirely correct"

… but a random baseline is almost never 50%

For our entity linking task, the random baseline is actually
1/6.997.326 = 0,000014% (a.k.a. 0%)

… and our upper bound wasn't 100% either …

In fact, we had to prune the database for efficiency
requirements ❖ Kept only 14% of all Wikipedia concepts ❖ An "oracle" disambiguation obtains 84% F-score So, the modeling work targets a 0% - 84% range

Ok, so within a range of 0-84%, 79% is really
good, right? Right?

Develop some simple baselines to better understand the complexity and
data challenges of your project

In our case, we won't assign "Hamilton" to a random
page out of the 7M available ones

Instead, we obtain a list of "Hamilton" candidates through lexical
search

Now, if we pick a random one from this candidate
list, we actually obtain 54% F-score (without doing any ML at all!)

Let's create an even stronger baseline.

Consider "prior probabilities": a measure of how often a mention
is linked to a certain concept Textual mention F1 race driver 18th century military officer Duke of Hamilton "Hamilton" 75% 7% 10% "Lewis Hamilton" 99% 0% 0% "Alexander Hamilton" 0% 97% 1%

Now extrapolate them to the extreme and use this as
"predictions": Textual mention F1 race driver 18th century military officer Duke of Hamilton "Hamilton" 100% 0% 0% "Lewis Hamilton" 100% 0% 0% "Alexander Hamilton" 0% 100% 0%

This "prior probabilities" baseline obtains 78.2% F-score (still without any
ML at all!)

The ML model that is supposed to disambiguate based on
context (79%) only marginally improves upon a relatively simple baseline (78.2%)

Developing a few simple baselines puts your ML's performance measures
in the right perspective

Let's revisit our upper bound (again). Is it really 100%
if we have no memory / efficiency requirements? 0 100% The best model "Entirely correct"

100% precision means every disambiguation is "correct". But what does
"correctness" even mean?

? ?

"Societies in the ancient civilizations of Greece and Rome preferred
small families" ? ?

"Full metro systems are in operation in Paris, Lyon and
Marseille" ? ?

When multiple annotators label the same data sample, how often
do they agree? ↪ Inter-annotator agreement = IAA

Annotators mostly agree → the gold data is reliable and
robust Annotators disagree because… ★ The data is confusing ★ The label scheme is ambiguous ★ The NLP task is too complex High IAA Low IAA

We need to label incoming articles for our portal 1.
Sports a. Football b. F1 2. Politics a. Global politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature

Let's plot IAA of the labels… Football F1 Global politics
US Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0

Confusion between global and US politics? Football F1 Global politics
US Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0

Consider merging labels if humans can't reliably distinguish them

Confusion between politics and F1? Football F1 Global politics US
Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0

Solution: allow multiple correct labels 1. Sports a. Football b.
F1 2. Politics a. Global politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature

Reframe the task to match your data

Confusion within the sports category? Football F1 Global politics US
Politics Traveling Sports Literature Football 0.79 F1 0 0.73 Global politics 0 0 0.95 US Politics 0 0.12 0.05 0.83 Traveling 0 0 0 0 1.0 Sports 0.21 0.15 0 0 0 0.64 Literature 0 0 0 0 0 0 1.0

1. Sports a. Football b. F1 2. Politics a. Global
politics b. US politics 3. Leisure a. Traveling b. Sports c. Literature Solution: critically revise the label scheme given to you! Inherent ambiguity in the label scheme needs to be addressed

Together with your domain experts, critically revise the label scheme

What if we don't have multiple annotators?

Analyse discrepancies between a model's prediction and the gold standard
label

"Agnes Maria of Andechs-Merania (died 1201) was a Queen of
France." Gold annotation Prediction

Prediction Gold ≠ Is this really an incorrect prediction (a
"false positive")? Or is the label wrongly annotated? Perhaps both answers can be seen as "correct"?

Depending on the downstream use-case, the precision of this method
was 87-96%

Analyse "wrong" predictions on the training dataset, to find structural
data errors

Manually annotated training dataset: "... to kick off 2025 [Dutch
Grand Prix] weekend in Zandvoort" "Gasly laments ‘quite sad’ [Monaco] GP crash" Sample prediction: "Gasly laments ‘quite sad’ [Monaco GP] crash"

Always include the words "Grand Prix" or "GP" into the
entity annotation: ❌ Gasly laments ‘quite sad’ [Monaco] GP crash ✅ Gasly laments ‘quite sad’ [Monaco GP] crash Write up annotation guidelines to help with consistency of annotations

"Correctness" may also depend on the downstream usage of your
results

? ?

Make sure you climb the right hill

Is your data correct, consistent and robust? Is your test
data representative of real-world data? Does your evaluation capture the value of the algorithm in production?

Only after defining the right hill to climb, you can
get to the fun stuff: ★ Algorithm development ★ Model training ★ LLM fine-tuning ★ …

Let's wrap up with a practical checklist 📝

ML project: data annotation ✓ Ensure your label scheme is
consistent and unambiguous ✓ Draft clear annotation guidelines to ensure data consistency ✓ Measure inter-annotator agreement (IAA) ✓ Consider reframing your task/guidelines if the IAA is low ✓ Model uncertainty in your annotation workflow 📝 1/3

ML project: performance evaluation ✓ Develop simple baselines to put
performance into perspective ✓ Quantify realistic upper/lower performance bounds ✓ Measure performance as part of the larger business process 📝 2/3

ML project: performance evaluation ✓ Identify structural data errors by
"predicting" the training data ✓ Apply to truly unseen data to measure realistic performance ✓ Make sure you’re climbing the right hill 📝 3/3

Main Take Away Hamilton won the Belgian 2024 Grand Prix,
not Russell ;-)

Thank you! Sofie Van Landeghem NLP freelancer @ OxyKodit PyCon
Italia 2025

2025-05-31-pycon_italia

2025-05-31-pycon_italia

More Decks by Sofie Van Landeghem

Other Decks in Science

Featured

Transcript