Hongseok Namkoong (Columbia University, New York, USA) On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets

We need a modeling language for a data-centric view of
AI Hongseok Namkoong [email protected] Decision, Risk, and Operations Division, Columbia Business School Based on joint works with Tiﬀany Cai, Peng Cui, Jiashuo Liu, and Tianyu Wang

AI builds on data as infrastructure

Pattern recognition will reﬂect existing biases

• Standard approach: Solve average-case risk minimization • Distributionally robust
optimization: Solve worst-case problem • Idea: Do well almost all the time, instead of on average! Application of optimal transport E.g., Kuhn, Esfahani, Nguyen, Shafieezadeh-Abadeh (2019)

“Robust” AI • Many algorithmic solutions toward robustness, generalization, and
fairness • These are just my body of work on the topic—so that I can dish on them later!

Self-reﬂections on my research • While intellectually satisfying, these algos
have not contributed to any major success in ML/AI • My experience: good for last-layer interventions (e.g., fairness adjustments), but these ideas do not scale! ◦ Key issue: Data, data, data… • Today: What impact can theory-driven principles have in ML/AI?

Slide credit: Ludwig Schmidt

ImageNet V2 • Slide credit: Ludwig Schmidt Big drop

Improving eﬀective robustness • How do we go up the
red line? Algorithmic interventions do not provide this robustness • Only larger training data—as a result, recent works in AI largely focus on scaling data from the internet • No principled understanding of datasets Caveat: This is a one-slide summary of an entire ﬁeld; naturally, I omit nuances.

Modeling language for datasets • Cost of data collection a
binding constraint outside of the internet • We cannot just “scale” data; need to understand which data to collect • To start, let’s examine implicit assumptions so far ◦ AI researchers focus on building a universally robust model, just like humans! ◦ Implicitly, this view focuses on covariate shift (X-shift), e.g., image recognition ◦ One-size-ﬁts-all mindset

X-shifts vs. Y|X-shifts • On the other hand, we expect
Y|X-shifts when there are unobserved factors whose distribution changes across time & space X-shifts Y|X-shifts changes in sampling, underrepresented groups changes in labeling, poorly chosen X, confounders

X-shifts vs. Y|X-shifts • On the other hand, we expect
Y|X-shifts when there are unobserved factors whose distribution changes across time & space • Conjecture: Y|X-shifts are more prominent in practice • For Y|X-shifts, we don’t expect a single model to perform well across distributions • Requires application-speciﬁc understanding of distributional diﬀerences

• Look at loss ratio of deployed model vs. best
model for target Even tabular benchmarks mainly focus on X-shifts

Liu, Wang, Cui, Namkoong, On the Need for a Language
Describing Distribution Shifts: Illustrations on Tabular Datasets • Look at loss ratio of deployed model vs. best model for target Even tabular benchmarks mainly focus on X-shifts

• 7 spatiotemporal and demographic shifts from 5 tabular datasets
• Out of 169 train-target pairs with signiﬁcant performance degradation, 80% of them are primarily attributed to Y|X-shifts. • CS benchmarking view breaks down: we can’t just compare models based on their out-of-distribution performance! • Infeasible to simultaneously perform well across train and target • We need to build an understanding of why the distribution changed! WhyShift https://github.com/namkoong-lab/whyshift arxiv github

• Train & target performance correlated only when X-shifts dominate
Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets ImageNet Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts

Accuracy on the line: on the strong correlation between out-of-distribution
and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets • Train & target performance correlated only when X-shifts dominate Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts

• Existing algos (e.g. DRO) do not provide reliable gains
◦ They make assumptions about data distributions but do not check them ◦ Need application-speciﬁc understanding of real shift patterns • We need a modeling language for distribution shifts! One size ﬁts all

• Distributionally robust optimization: Solve worst-case problem • Choice of
ambiguity set arbitrary; primarily driven by mathematical convenience and details “left to the modeler” • Little thought given to model class DRO revisited

Empirical analysis of 10,000+ DRO models • Examine the impact
of algorithmic design knobs on model performance Model Class (Tree, Linear, MLP) Ambiguity Set (Distance Type, Radius) Shift Pattern (Y|X-ratio) Validation Type (Average, Worst) Task/State fixed effect

Target performance: single state • Model class most important! •
Trees >>> DRO ambiguity set

• Eﬀect of ambiguity set inconsistent across diﬀerent outcomes Upper:
Predict whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: single state

• Even for worst-state performance, DRO is unreliable Upper: Predict
whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: worst state

Toward better ambiguity sets • Consider covariate shifts induced by
age subgroups: [20,25), [25,30), …, [75,100) • Consider DRO methods that consider shifts on a subset of covariates • Variable selection for ambiguity set: top-k with largest subgroup diﬀerences • Performance varies a lot over variables selected k k all all Marginal DRO Wasserstein DRO

AI pipeline Data collection Model training Validation & Monitoring AI
development cycle

Today: A step toward a modeling language • Current ML
view ◦ Distribution shift: out-of-distribution performance is worse than in-distribution performance! ◦ But this just means P: train Q: target • Attribute performance degradation: not all shifts matter • Diﬀerent shifts warrant diﬀerent interventions Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

density of X P x Q x X=age expected loss
given X E Q [L|X] E P [L|X] L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

density of X P x Q x X=age expected loss
given X E Q [L|X] E P [L|X] You can only compare Y|X on shared X E P [L|X] not well-defined E Q [L|X] not well-defined L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Deﬁne Shared Distribution density of X P x Q x
S x density of X X=age X=age L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Decompose change in performance E P [E P [L|X]] E
Q [E Q [L|X]] L: loss P: train Q: target S: shared Performance on the training distribution Performance on the target distribution Decompose into X-shift vs. Y|X-shift Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] L: loss P: train Q: target S: shared Diagnosis: S has more X’s that are harder to predict than P Potential interventions: Use domain adaptation, e.g. importance weighting Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Y|X moves farther from predicted model Potential interventions: Re-collect data or modify covariates L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

E P [E P [L|X]] E S [E P [L|X]]
E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Q has “new” X’s that are harder to predict than S Potential interventions: Collect + label more data on “new” examples L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Decompose change in performance

Employment prediction case study [X shift] P: only age ≤25,
Q: general population Performance attributed to X shift (S Q), meaning “new examples” such as older people L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Substantial portion attributed to X shift (P S), suggesting domain
adaptation may be eﬀective L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Employment prediction case study [X shift] P: age ≤25 overrepresented, Q: evenly sampled population

WV model does not use education. Y|X shift because of
missing covariate: education aﬀects employment L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Employment prediction case study [Y|X shift] P: West Virginia, Q: Maryland

Better data can be more eﬀective than better algorithms! No
language features With language features [Y|X shift] P: California (CA), Q: Puerto Rico (PR) CA model does not use language. Y|X shift because of missing covariate: language aﬀects outcome → better performance in PR

• Diagnostic for understanding why performance dropped in terms of
X vs Y|X shift • Can help articulate modeling assumptions + data collection We need a modeling language for a data-centric view of AI • Limitations: shared space not easy to understand in high dimensions • Optimal transport can provide a ﬂexible modeling language • What is the right geometry to model distribution shifts? Distribution Shift Decomposition (DISDE) Cai, Namkoong, and Yadlowsky, Diagnosing Model Performance Under Distribution Shift, Major revision in Operations Research, https://github.com/namkoong-lab/disde Liu, Wang, Cui, and Namkoong, On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets, NeurIPS 2023, https://github.com/namkoong-lab/whyshift

Hongseok Namkoong (Columbia University, New Yor...

Hongseok Namkoong (Columbia University, New York, USA) On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets

More Decks by Jia-Jie Zhu

Featured

Transcript