Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Skrub: machine-learning with dataframes

Avatar for Gael Varoquaux Gael Varoquaux
September 03, 2025

Skrub: machine-learning with dataframes

While data-science talks a lot about machine learning, much of the work is preparing and assembling dataframes. Such work is very manual. Here I introduce Skrub, a young package that ease machine-learning with dataframes. It provides a variety of tools to plug any scikit-learn type machine-learning on complex and messy dataframes with no manual work.

I also cover the exciting "DataOps" plans, in the new release, which wrap and record any data assembling or wrangling pipeline and can apply machine-learning workflow on it: applying the plan on new data, cross-validation or tuning it to maximize prediction accuracy on a task.

Avatar for Gael Varoquaux

Gael Varoquaux

September 03, 2025
Tweet

More Decks by Gael Varoquaux

Other Decks in Technology

Transcript

  1. Data preparation ≫ Machine learning Download numbers⋆ don’t lie 0.0

    0.5 1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M G Varoquaux 2 ⋆ from pypistats.org
  2. Tabular data: statistical properties Columns of different nature, different distributions

    Categorical data, strings, dates... G Varoquaux 3 Sex Experience Age Employee Position Title M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M
  3. Real tables are too messy for sklearn (and most ML

    tools) Columns of different types (strings, dates...) G Varoquaux 4 Sex Age Position Title M 42 Police Officer F NA Social Worker IV M 28 Police Officer III F 45 Police Aide M 48 Electrician I M 36 Bus Operator M 62 Bus Operator
  4. Real tables are too messy for sklearn – skrub preprocesses

    them Columns of different types (strings, dates...) Column-specific preprocessing casting to numbers skrub new Machine learning with dataframes tab vec = skrub.TableVectorizer() X = tab vec.fit transform(df) learner = sklearn.pipeline.make pipeline( skrub.TableVectorizer(), HistGradientBoostingClassifier() ) Strong baseline G Varoquaux 4 Sex Age Position Title M 42 Police Officer F NA Social Worker IV M 28 Police Officer III F 45 Police Aide M 48 Electrician I M 36 Bus Operator M 62 Bus Operator
  5. It just a matter of data preparation Gaussianization, encode strings,

    remove outliers... Convert datatimes escape unseen categories G Varoquaux 5 erience age size 10 42 3 23 NA 121 3 28 12 16 45 23 13 48 3 231 6 36 593 NA 62 32 9 35 NA NA 39 238
  6. Data transformation: Separating out specification/fitting df = pd.read csv(’employee˙salary.csv’) Gender

    Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Vectorize: transform to numerical matrix G Varoquaux 6
  7. Data transformation: Separating out specification/fitting df = pd.read csv(’employee˙salary.csv’) Gender

    Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III Gender: categorical encoding df = pd.get dummies(df) Apply to new data? pd.get dummies again? - Columns in same order? - What if new / different categories? Gender (M) Gender (F) ... 1 0 0 1 1 0 0 1 G Varoquaux 6
  8. Data transformation: Separating out specification/fitting df = pd.read csv(’employee˙salary.csv’) Gender

    Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III Gender: categorical encoding ohe = OneHotEncoder() ohe.fit(df) X test = ohe.transform(df test) Gender (M) Gender (F) ... 1 0 0 1 sklearn’s separation of fit & transform Enables putting in prod Model evaluation without data leakage G Varoquaux 6
  9. Transformations adapted to column types Date Hired: - Time in

    seconds, day of week, month - Periodic splines? ... Date Hired Position Title 09/12/1988 Librarian 06/26/2006 Social Worker IV 07/16/2007 Police Officer III 01/26/2000 Police Aide dte = skrub.DatetimeEncoder() dte.fit(df[’Date Hired’]) X = dte.transform(df[’Date Hired’]) G Varoquaux 8
  10. Transformations adapted to column types Date Hired: - Time in

    seconds, day of week, month - Periodic splines? ... Date Hired Position Title 09/12/1988 Librarian 06/26/2006 Social Worker IV 07/16/2007 Police Officer III 01/26/2000 Police Aide dte = skrub.DatetimeEncoder() dte.fit(df[’Date Hired’]) X = dte.transform(df[’Date Hired’]) Position Title: - string categorical encoders G Varoquaux 8
  11. Modeling strings, rather than categories Notion of category ⇔ entity

    normalization Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 9
  12. String encoding: via substrings Sub-string count matrices 3-gram1 P 3-gram2

    ol 3-gram3 ic... 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol Reduce it: lighter representations G Varoquaux 10 [Cerda and Varoquaux 2020]
  13. String encoding: via substrings Sub-string count matrices 3-gram1 P 3-gram2

    ol 3-gram3 ic... MinHashEncoder fast, stateless (for distributed computing) - Collisions approximate Jaccard index - Great for tree-based models - Not interpretable features 10−3 10−2 10−1 Senior Technician Supply Senior Supply Technician Min-hash encoder (employee salaries) G Varoquaux 10 [Cerda and Varoquaux 2020]
  14. String encoding: via substrings Sub-string count matrices 3-gram1 P 3-gram2

    ol 3-gram3 ic... MinHashEncoder fast, stateless (for distributed computing) Matrix factorizations 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 10 [Cerda and Varoquaux 2020]
  15. String encoding: via substrings Sub-string count matrices 3-gram1 P 3-gram2

    ol 3-gram3 ic... MinHashEncoder fast, stateless (for distributed computing) GapEncoder: Gamma-Poisson model, for latent categories stant, library ment, operator on, specialist ker, warehouse ogram, manager nic, community escuer, rescue ction, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant feature nam es Categories G Varoquaux 10 [Cerda and Varoquaux 2020]
  16. String encoding: via substrings Sub-string count matrices 3-gram1 P 3-gram2

    ol 3-gram3 ic... MinHashEncoder fast, stateless (for distributed computing) GapEncoder: Gamma-Poisson model, for latent categories StringEncoder: randomized linear algebra 11111000000000 00000011111111 10000001100000 11100000000000 00000000 police officer pol off polis er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 documents topics + What substrings are in a latent er_ cer fic off _of ce_ ice lic pol G Varoquaux 10
  17. String encoding MinHashEncoder fast, stateless (for distributed computing) GapEncoder: Gamma-Poisson

    model, for latent categories StringEncoder: randomized linear algebra TextEncoder: Pre-trained language models - downloaded from huggingface hub G Varoquaux 10
  18. String encoding MinHashEncoder fast, stateless (for distributed computing) GapEncoder: Gamma-Poisson

    model, for latent categories StringEncoder: randomized linear algebra TextEncoder: Pre-trained language models G Varoquaux 11
  19. Tables, not columns: the TableVectorizer tab vec = skrub.TableVectorizer() X

    = tab vec.fit transform(df) Heuristics for different columns strings with ≥ 30 categories ⇒ StringEncoder date/time ⇒ DateTimeEncoder non-string discrete ⇒ TargetEncoder ... learner = sklearn.pipeline.make pipeline( skrub.TableVectorizer(), sklearn.ensemble.HistGradientBoostingClassifier() ) Strong baseline G Varoquaux 13
  20. Tables, not columns: the TableVectorizer tab vec = skrub.TableVectorizer() X

    = tab vec.fit transform(df) Heuristics for different columns strings with ≥ 30 categories ⇒ StringEncoder date/time ⇒ DateTimeEncoder non-string discrete ⇒ TargetEncoder ... Play with demo code: GaelVaroquaux/skrub presentation 2025 02 preparation simple.py G Varoquaux 13
  21. Tables are messy: the Cleaner tab vec = skrub.Cleaner() X

    = tab vec.fit transform(df) Sanitizing types Detecting & converting datetimes Detecting & converings NA Included in TableVectorizer ... Useful as a pipeline component G Varoquaux 14
  22. Users sklearn pipeline Want simple “actions” on data seeing immediate

    modifications Pandas is way more expressive 0.0 0.5 1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M Real assembly is across tables G Varoquaux 15
  23. But pandas’ code cannot be put in production cannot be

    applied to new data cannot be tuned G Varoquaux 16
  24. A real analysis isn’t like a sklearn pipeline aggregated products

    = products .groupby(”basket˙ID”) .agg(”mean”).reset index() features = basket IDs.merge( aggregated products , on=”basket˙ID”) from sklearn.ensemble import ExtraTreesClassifier ExtraTreesClassifier(). f i t ( features , fraud flags) G Varoquaux 17
  25. skrub DataOps: wrap this wrangling code # Wrap the inputs

    of our skrub pipeline products = skrub.var(”products”, products df) baskets = skrub.var(”baskets”, baskets df) # Specify ”X” and ”y” variables for machine learning basket IDs = baskets[[”basket˙ID”]].skb.mark as X() fraud flags = baskets[”fraud˙flag”].skb.mark as y() aggregated products = products.groupby(”basket˙ID”) .agg(”mean”).reset index() features = basket IDs.merge(aggregated products , on=”basket˙ID”) predictions = features.skb.apply( ExtraTreesClassifier(), y=fraud flags) VAR 'baskets' X: GETITEM ['basket_ID'] y: GETITEM 'fraud_flag' CALLMETHOD 'merge' VAR 'products' CALLMETHOD 'groupby' CALLMETHOD 'agg' CALLMETHOD 'reset_index' APPLY ExtraTreesClassifier G Varoquaux 18
  26. skrub DataOps: put this wrangling in production products = skrub.var(”products”,

    products df) baskets = skrub.var(”baskets”, baskets df) # ... aggregated products = products.groupby(”basket˙ID”).agg(”mean”) reset index() features = basket IDs.merge(aggregated products , on=”basket˙ID” predictions = features.skb.apply(ExtraTreesClassifier(), y=frau # Apply a predictor to new data predictor = predictions.skb.make learner( f i t t e d =True) y pred = predictor.predict( –’baskets’: baskets df test , ’products’: products df test˝ ) G Varoquaux 19
  27. skrub DataOps: cross-validate, tune this wrangling aggregated products = products.groupby(”basket˙ID”).agg(

    skrub.choose from((”mean”, ”max”, ”count”))).reset index() features = basket IDs.merge(aggregated products , on=”basket˙ID” predictions = features.skb. apply(ExtraTreesClassifier(), y=fraud flags) search = predictions.skb. make grid search() search.plot results() G Varoquaux 20
  28. skrub DataOps: cross-validate, tune this wrangling aggregated products = products.groupby(”basket˙ID”).agg(

    skrub.choose from((”mean”, ”max”, ”count”))).reset index() features = basket IDs.merge(aggregated products , on=”basket˙ID” predictions = features.skb. apply(ExtraTreesClassifier(), y=fraud flags) search = predictions.skb. make grid search() search.plot results() G Varoquaux 20 Play with demo code: GaelVaroquaux/skrub presentation 2025 04 assembling simple.py
  29. Addressing the validation bottleneck skore new proj = skore.Project(”project”) proj.put(skore.EstimatorReport(

    RandomForestClassifier(), X train=X train , X test=X test , y t r a i n =y train , y test=y test , )) proj.put(skore.EstimatorReport( LogisticRegression(), X train=X train , X test=X test , y t r a i n =y train , y test=y test , )) proj.summarize() G Varoquaux 22
  30. Great ML tools scikit-learn Democratizing predictive models skrub new Data

    wrangling meets ML skore new Domain evaluation G Varoquaux 23
  31. Spinning off :Probabl. Mission A non-captured data-science stack “start-up” like

    structure Flexible hire, organization... Answering needs of industry Model Open-source core Services & integration G Varoquaux 24
  32. skrub: machine learning with dataframes Visualizing dataframes: skrub.TableReport(df) Vectorizing dataframes

    for machine learning skrub.TableVectorizer().fit transform(df) skrub.Cleaner().fit transform(df) All kind of column transformers (strings, dates, types) DataOps: arbitrary data wrangling Wrap any data transformation code G Varoquaux 25
  33. References I P. Cerda and G. Varoquaux. Encoding high-cardinality string

    categorical variables. Transactions in Knowledge and Data Engineering, 2020. G Varoquaux 26