J. Henry Hinnefeld - Measuring Model Fairness

Measuring Model Fairness J. Henry Hinnefeld [email protected] hinnefe2.github.io DrJSomeday

Outline 1. Motivation 2. Subtleties of measuring fairness 3. Case
Study 4. Python tools 5. Conclusion

Models determine whether you can buy a home... https://www.flickr.com/photos/cafecredit/26700612773

and what advertisements you see ... https://www.flickr.com/photos/44313045@N08/6290270129

and how long you spend in jail. https://www.flickr.com/photos/archivesnz/27160240521

How do you measure if your model is fair? https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

How do you measure if your model is fair? http://www.northpointeinc.com/files/publications/Criminal-Justice-Behavior-COMPAS.pdf
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

How do you decide which measure of fairness is appropriate?
https://pixabay.com/en/legal-scales-of-justice-judge-450202/ Inherent Trade-Offs in the Fair Determination of Risk Scores Jon Kleinberg, Sendhil Mullainathan, Manish Raghavan. 2016. https://arxiv.org/abs/1609.05807

Subtlety #1: Different groups can have different ground truth positive
rates https://www.breastcancer.org/symptoms/understand_bc/statistics

Certain fairness metrics make assumptions about the balance of ground
truth positive rates Disparate Impact is a popular metric which assumes that the ground truth positive rates for both groups are the same Certifying and removing disparate impact, Feldman et al. (https://arxiv.org/abs/1412.3756)

Datasets can contain label bias when a protected attribute affects
the way individuals are assigned labels. In addition, the results indicate that students from African American and Latino families are more likely than their White peers to receive expulsion or out of school suspension as consequences for the same or similar problem behavior. A dataset for predicting “student problem behavior” that used “has been suspended” for its label could contain label bias. ”Race is not neutral: A national investigation of African American and Latino disproportionality in school discipline.” Skiba et al. Subtlety #2: Your data is a biased representation of ground truth

Certain fairness metrics are based on agreement with possibly biased
labels Equal Opportunity is a popular metric which compares the True Positive rates between protected groups Equality of Opportunity in Supervised Learning, Hardt et al. (https://arxiv.org/pdf/1610.02413.pdf)

Datasets can contain sample bias when a protected attribute affects
the sampling process that generated your data. A dataset for predicting contraband possession that used stop-and-frisk data could contain sample bias. ”An analysis of the NYPD’s stop-and-frisk policy in the context of claims of racial bias” Gelman et al. Subtlety #2: Your data is a biased representation of ground truth

Certain fairness metrics compare classiﬁcation ratios between groups Disparate Impact
is a popular metric which compares the ratio of positive classiﬁcations between groups Certifying and removing disparate impact, Feldman et al. (https://arxiv.org/abs/1412.3756)

When a model is punitive you might care more about
False Positives. When a model is assistive you might care more about False Negatives. Subtlety #3: It matters whether the modeled decision’s consequences are positive or negative The point is you have to think about these questions.

We can’t math our way out of thinking about fairness
You still need a person to think about the ethical implications of your model Originally people thought “Models are just math, so they must be fair” Now there’s a temptation to say ‘Adding this constraint will make my model fair’ deﬁnitely not true still not automatically true

Can we detect real bias in real data? Spoiler: it
can be tough! • Start with real data from Civis's work ◦ Features are demographics, outcome is a probability ◦ Consider racial bias; white versus African American

Can we detect real bias in real data? Create artificial
datasets with known bias; then we'll see if we can detect it. • Start with real data from Civis's work ◦ Features are demographics, outcome is a probability ◦ Consider racial bias; white versus African American • Two datasets: ◦ Artificially balanced: select white subset and randomly re-assign race ◦ Unmodified (imbalanced) dataset

Next introduce known sample and label bias Sample bias: protected
class affects whether you're in the sample at all • Create a modiﬁed dataset with labels taken from the original data

Next introduce known sample and label bias Label bias: you're
in the dataset, but protected class affects your label • Use the original dataset but modify the labels

There are many possible metrics for model fairness These are
two popular ones Disparate Impact Equal Opportunity J. Henry Hinnefeld, Peter Cooman, Nat Mammo, and Rupert Deese, "Evaluating Fairness Metrics in the Presence of Dataset Bias"; https://arxiv.org/pdf/1809.09245.pdf

With balanced ground truth, both metrics detect bias Good news!
No bias Sam ple bias Label bias Both No bias Sam ple bias Label bias Both

With imbalanced ground truth, both metrics still detect bias... ...even
when there isn't any bias in the "truth". No bias Sam ple bias Label bias Both No bias Sam ple bias Label bias Both

Label bias is particularly hard to detect when the ground
truth is imbalanced No bias Sam ple bias Label bias Both No bias Sam ple bias Label bias Both

• Pro: easy to use • Con: non-standard license once
you’ve decided what deﬁnition of ‘fair’ makes sense for your problem There are open source python tools for measuring model fairness aequitas.dssg.io

• Pro: comprehensive, lots of documentation + tutorials • Con:
more comprehensive than you need, lots of dependencies once you’ve decided what deﬁnition of ‘fair’ makes sense for your problem There are open source python tools for measuring model fairness github.com/IBM/AIF360 AI Fairness 360 Open Source Toolkit

• Pro: offer a deeper understanding of your model’s behavior
• Con: harder to explain, existing code is research quality once you’ve decided what deﬁnition of ‘fair’ makes sense for your problem There are open source python tools for measuring model fairness github.com/slundberg/shap Model interpretation tools: LIME and SHAP github.com/marcotcr/lime

There's no one-size ﬁts all solution Except for "think hard
about your inputs and your outputs" • These metrics (and others) can help but you have to use them carefully

about your inputs and your outputs" • These metrics (and others) can help but you have to use them carefully • Use a diverse team to create the models and think about these questions! https://imgur.com/gallery/hem9m

about your inputs and your outputs" • These metrics (and others) can help but you have to use them carefully • Use a diverse team to create the models and think about these questions! • Know your data and think about your consequences https://pixabay.com/en/isolated-thinking-freedom-ape-1052504/

“Big Data processes codify the past. They do not invent
the future. Doing that requires moral imagination, and that’s something only humans can provide. We have to explicitly embed better values into our algorithms, creating Big Data models that follow our ethical lead. Sometimes that will mean putting fairness ahead of proﬁt.” ― Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

J. Henry Hinnefeld - Measuring Model Fairness

J. Henry Hinnefeld - Measuring Model Fairness

PyCon 2019

More Decks by PyCon 2019

Other Decks in Programming

Featured

Transcript

Measuring Model Fairness J. Henry Hinnefeld [email protected] hinnefe2.github.io DrJSomeday

Outline 1. Motivation 2. Subtleties of measuring fairness 3. Case

Models determine whether you can buy a home... https://www.flickr.com/photos/cafecredit/26700612773

and what advertisements you see ... https://www.flickr.com/photos/44313045@N08/6290270129

and how long you spend in jail. https://www.flickr.com/photos/archivesnz/27160240521

How do you measure if your model is fair? https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

How do you measure if your model is fair? https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

How do you measure if your model is fair? http://www.northpointeinc.com/files/publications/Criminal-Justice-Behavior-COMPAS.pdf

How do you decide which measure of fairness is appropriate?

Subtlety #1: Different groups can have different ground truth positive

Certain fairness metrics make assumptions about the balance of ground

Datasets can contain label bias when a protected attribute affects

Certain fairness metrics are based on agreement with possibly biased

Datasets can contain sample bias when a protected attribute affects

Certain fairness metrics compare classiﬁcation ratios between groups Disparate Impact

When a model is punitive you might care more about

We can’t math our way out of thinking about fairness

Can we detect real bias in real data? Spoiler: it

Can we detect real bias in real data? Create artiﬁcial

Next introduce known sample and label bias Sample bias: protected

Next introduce known sample and label bias Label bias: you're

There are many possible metrics for model fairness These are

With balanced ground truth, both metrics detect bias Good news!

With imbalanced ground truth, both metrics still detect bias... ...even

Label bias is particularly hard to detect when the ground

• Pro: easy to use • Con: non-standard license once

• Pro: comprehensive, lots of documentation + tutorials • Con:

• Pro: offer a deeper understanding of your model’s behavior

There's no one-size ﬁts all solution Except for "think hard

There's no one-size ﬁts all solution Except for "think hard

There's no one-size ﬁts all solution Except for "think hard

“Big Data processes codify the past. They do not invent