Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hongseok Namkoong (Columbia University, New Yor...

Jia-Jie Zhu
March 18, 2024
110

Hongseok Namkoong (Columbia University, New York, USA) On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets

WORKSHOP ON OPTIMAL TRANSPORT
FROM THEORY TO APPLICATIONS
INTERFACING DYNAMICAL SYSTEMS, OPTIMIZATION, AND MACHINE LEARNING
Venue: Humboldt University of Berlin, Dorotheenstraße 24

Berlin, Germany. March 11th - 15th, 2024

Jia-Jie Zhu

March 18, 2024
Tweet

More Decks by Jia-Jie Zhu

Transcript

  1. We need a modeling language for a data-centric view of

    AI Hongseok Namkoong [email protected] Decision, Risk, and Operations Division, Columbia Business School Based on joint works with Tiffany Cai, Peng Cui, Jiashuo Liu, and Tianyu Wang
  2. • Standard approach: Solve average-case risk minimization • Distributionally robust

    optimization: Solve worst-case problem • Idea: Do well almost all the time, instead of on average! Application of optimal transport E.g., Kuhn, Esfahani, Nguyen, Shafieezadeh-Abadeh (2019)
  3. “Robust” AI • Many algorithmic solutions toward robustness, generalization, and

    fairness • These are just my body of work on the topic—so that I can dish on them later!
  4. Self-reflections on my research • While intellectually satisfying, these algos

    have not contributed to any major success in ML/AI • My experience: good for last-layer interventions (e.g., fairness adjustments), but these ideas do not scale! ◦ Key issue: Data, data, data… • Today: What impact can theory-driven principles have in ML/AI?
  5. Improving effective robustness • How do we go up the

    red line? Algorithmic interventions do not provide this robustness • Only larger training data—as a result, recent works in AI largely focus on scaling data from the internet • No principled understanding of datasets Caveat: This is a one-slide summary of an entire field; naturally, I omit nuances.
  6. Modeling language for datasets • Cost of data collection a

    binding constraint outside of the internet • We cannot just “scale” data; need to understand which data to collect • To start, let’s examine implicit assumptions so far ◦ AI researchers focus on building a universally robust model, just like humans! ◦ Implicitly, this view focuses on covariate shift (X-shift), e.g., image recognition ◦ One-size-fits-all mindset
  7. X-shifts vs. Y|X-shifts • On the other hand, we expect

    Y|X-shifts when there are unobserved factors whose distribution changes across time & space X-shifts Y|X-shifts changes in sampling, underrepresented groups changes in labeling, poorly chosen X, confounders
  8. X-shifts vs. Y|X-shifts • On the other hand, we expect

    Y|X-shifts when there are unobserved factors whose distribution changes across time & space • Conjecture: Y|X-shifts are more prominent in practice • For Y|X-shifts, we don’t expect a single model to perform well across distributions • Requires application-specific understanding of distributional differences
  9. • Look at loss ratio of deployed model vs. best

    model for target Even tabular benchmarks mainly focus on X-shifts
  10. Liu, Wang, Cui, Namkoong, On the Need for a Language

    Describing Distribution Shifts: Illustrations on Tabular Datasets • Look at loss ratio of deployed model vs. best model for target Even tabular benchmarks mainly focus on X-shifts
  11. • 7 spatiotemporal and demographic shifts from 5 tabular datasets

    • Out of 169 train-target pairs with significant performance degradation, 80% of them are primarily attributed to Y|X-shifts. • CS benchmarking view breaks down: we can’t just compare models based on their out-of-distribution performance! • Infeasible to simultaneously perform well across train and target • We need to build an understanding of why the distribution changed! WhyShift https://github.com/namkoong-lab/whyshift arxiv github
  12. • Train & target performance correlated only when X-shifts dominate

    Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets ImageNet Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts
  13. Accuracy on the line: on the strong correlation between out-of-distribution

    and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets • Train & target performance correlated only when X-shifts dominate Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts
  14. • Existing algos (e.g. DRO) do not provide reliable gains

    ◦ They make assumptions about data distributions but do not check them ◦ Need application-specific understanding of real shift patterns • We need a modeling language for distribution shifts! One size fits all
  15. • Distributionally robust optimization: Solve worst-case problem • Choice of

    ambiguity set arbitrary; primarily driven by mathematical convenience and details “left to the modeler” • Little thought given to model class DRO revisited
  16. Empirical analysis of 10,000+ DRO models • Examine the impact

    of algorithmic design knobs on model performance Model Class (Tree, Linear, MLP) Ambiguity Set (Distance Type, Radius) Shift Pattern (Y|X-ratio) Validation Type (Average, Worst) Task/State fixed effect
  17. • Effect of ambiguity set inconsistent across different outcomes Upper:

    Predict whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: single state
  18. • Even for worst-state performance, DRO is unreliable Upper: Predict

    whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: worst state
  19. Toward better ambiguity sets • Consider covariate shifts induced by

    age subgroups: [20,25), [25,30), …, [75,100) • Consider DRO methods that consider shifts on a subset of covariates • Variable selection for ambiguity set: top-k with largest subgroup differences • Performance varies a lot over variables selected k k all all Marginal DRO Wasserstein DRO
  20. Today: A step toward a modeling language • Current ML

    view ◦ Distribution shift: out-of-distribution performance is worse than in-distribution performance! ◦ But this just means P: train Q: target • Attribute performance degradation: not all shifts matter • Different shifts warrant different interventions Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
  21. density of X P x Q x X=age expected loss

    given X E Q [L|X] E P [L|X] L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
  22. density of X P x Q x X=age expected loss

    given X E Q [L|X] E P [L|X] You can only compare Y|X on shared X E P [L|X] not well-defined E Q [L|X] not well-defined L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
  23. Define Shared Distribution density of X P x Q x

    S x density of X X=age X=age L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
  24. Decompose change in performance E P [E P [L|X]] E

    Q [E Q [L|X]] L: loss P: train Q: target S: shared Performance on the training distribution Performance on the target distribution Decompose into X-shift vs. Y|X-shift Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
  25. Decompose change in performance E P [E P [L|X]] E

    S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] L: loss P: train Q: target S: shared Diagnosis: S has more X’s that are harder to predict than P Potential interventions: Use domain adaptation, e.g. importance weighting Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
  26. Decompose change in performance E P [E P [L|X]] E

    S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Y|X moves farther from predicted model Potential interventions: Re-collect data or modify covariates L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
  27. E P [E P [L|X]] E S [E P [L|X]]

    E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Q has “new” X’s that are harder to predict than S Potential interventions: Collect + label more data on “new” examples L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Decompose change in performance
  28. E P [E P [L|X]] E S [E P [L|X]]

    E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Decompose change in performance
  29. Employment prediction case study [X shift] P: only age ≤25,

    Q: general population Performance attributed to X shift (S Q), meaning “new examples” such as older people L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
  30. Substantial portion attributed to X shift (P S), suggesting domain

    adaptation may be effective L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Employment prediction case study [X shift] P: age ≤25 overrepresented, Q: evenly sampled population
  31. WV model does not use education. Y|X shift because of

    missing covariate: education affects employment L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Employment prediction case study [Y|X shift] P: West Virginia, Q: Maryland
  32. Better data can be more effective than better algorithms! No

    language features With language features [Y|X shift] P: California (CA), Q: Puerto Rico (PR) CA model does not use language. Y|X shift because of missing covariate: language affects outcome → better performance in PR
  33. • Diagnostic for understanding why performance dropped in terms of

    X vs Y|X shift • Can help articulate modeling assumptions + data collection We need a modeling language for a data-centric view of AI • Limitations: shared space not easy to understand in high dimensions • Optimal transport can provide a flexible modeling language • What is the right geometry to model distribution shifts? Distribution Shift Decomposition (DISDE) Cai, Namkoong, and Yadlowsky, Diagnosing Model Performance Under Distribution Shift, Major revision in Operations Research, https://github.com/namkoong-lab/disde Liu, Wang, Cui, and Namkoong, On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets, NeurIPS 2023, https://github.com/namkoong-lab/whyshift