Skrub: machine-learning with dataframes

Skrub: machine-learning with dataframes Ga¨ el Varoquaux :Probabl.

G Varoquaux 1

Data preparation ≫ Machine learning Download numbers⋆ don’t lie 0.0
0.5 1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M G Varoquaux 2 ⋆ from pypistats.org

Tabular data: statistical properties Columns of different nature, different distributions
Categorical data, strings, dates... G Varoquaux 3 Sex Experience Age Employee Position Title M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M

Real tables are too messy for sklearn (and most ML
tools) Columns of different types (strings, dates...) G Varoquaux 4 Sex Age Position Title M 42 Police Officer F NA Social Worker IV M 28 Police Officer III F 45 Police Aide M 48 Electrician I M 36 Bus Operator M 62 Bus Operator

Real tables are too messy for sklearn – skrub preprocesses
them Columns of different types (strings, dates...) Column-specific preprocessing casting to numbers skrub new Machine learning with dataframes tab vec = skrub.TableVectorizer() X = tab vec.fit transform(df) learner = sklearn.pipeline.make pipeline( skrub.TableVectorizer(), HistGradientBoostingClassifier() ) Strong baseline G Varoquaux 4 Sex Age Position Title M 42 Police Officer F NA Social Worker IV M 28 Police Officer III F 45 Police Aide M 48 Electrician I M 36 Bus Operator M 62 Bus Operator

It just a matter of data preparation Gaussianization, encode strings,
remove outliers... Convert datatimes escape unseen categories G Varoquaux 5 erience age size 10 42 3 23 NA 121 3 28 12 16 45 23 13 48 3 231 6 36 593 NA 62 32 9 35 NA NA 39 238

Data transformation: Separating out specification/fitting df = pd.read csv(’employee˙salary.csv’) Gender
Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Vectorize: transform to numerical matrix G Varoquaux 6

Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III Gender: categorical encoding df = pd.get dummies(df) Apply to new data? pd.get dummies again? - Columns in same order? - What if new / different categories? Gender (M) Gender (F) ... 1 0 0 1 1 0 0 1 G Varoquaux 6

Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III Gender: categorical encoding ohe = OneHotEncoder() ohe.fit(df) X test = ohe.transform(df test) Gender (M) Gender (F) ... 1 0 0 1 sklearn’s separation of fit & transform Enables putting in prod Model evaluation without data leakage G Varoquaux 6

Skrub software http://skrub-data.org bring simplicity to machine learning pipelines with
dataframes G Varoquaux 7

Transformations adapted to column types Date Hired: - Time in
seconds, day of week, month - Periodic splines? ... Date Hired Position Title 09/12/1988 Librarian 06/26/2006 Social Worker IV 07/16/2007 Police Officer III 01/26/2000 Police Aide dte = skrub.DatetimeEncoder() dte.fit(df[’Date Hired’]) X = dte.transform(df[’Date Hired’]) G Varoquaux 8

Transformations adapted to column types Date Hired: - Time in
seconds, day of week, month - Periodic splines? ... Date Hired Position Title 09/12/1988 Librarian 06/26/2006 Social Worker IV 07/16/2007 Police Officer III 01/26/2000 Police Aide dte = skrub.DatetimeEncoder() dte.fit(df[’Date Hired’]) X = dte.transform(df[’Date Hired’]) Position Title: - string categorical encoders G Varoquaux 8

Modeling strings, rather than categories Notion of category ⇔ entity
normalization Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 9

String encoding: via substrings Sub-string count matrices 3-gram1 P 3-gram2
ol 3-gram3 ic... G Varoquaux 10

ol 3-gram3 ic... 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol Reduce it: lighter representations G Varoquaux 10 [Cerda and Varoquaux 2020]

ol 3-gram3 ic... MinHashEncoder fast, stateless (for distributed computing) - Collisions approximate Jaccard index - Great for tree-based models - Not interpretable features 10−3 10−2 10−1 Senior Technician Supply Senior Supply Technician Min-hash encoder (employee salaries) G Varoquaux 10 [Cerda and Varoquaux 2020]

ol 3-gram3 ic... MinHashEncoder fast, stateless (for distributed computing) Matrix factorizations 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 10 [Cerda and Varoquaux 2020]

ol 3-gram3 ic... MinHashEncoder fast, stateless (for distributed computing) GapEncoder: Gamma-Poisson model, for latent categories stant, library ment, operator on, specialist ker, warehouse ogram, manager nic, community escuer, rescue ction, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant feature nam es Categories G Varoquaux 10 [Cerda and Varoquaux 2020]

ol 3-gram3 ic... MinHashEncoder fast, stateless (for distributed computing) GapEncoder: Gamma-Poisson model, for latent categories StringEncoder: randomized linear algebra 11111000000000 00000011111111 10000001100000 11100000000000 00000000 police officer pol off polis er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 documents topics + What substrings are in a latent er_ cer fic off _of ce_ ice lic pol G Varoquaux 10

String encoding MinHashEncoder fast, stateless (for distributed computing) GapEncoder: Gamma-Poisson
model, for latent categories StringEncoder: randomized linear algebra TextEncoder: Pre-trained language models - downloaded from huggingface hub G Varoquaux 10

String encoding MinHashEncoder fast, stateless (for distributed computing) GapEncoder: Gamma-Poisson
model, for latent categories StringEncoder: randomized linear algebra TextEncoder: Pre-trained language models G Varoquaux 11

Skrub to bring simplicity G Varoquaux 12

Tables, not columns: the TableVectorizer tab vec = skrub.TableVectorizer() X
= tab vec.fit transform(df) Heuristics for different columns strings with ≥ 30 categories ⇒ StringEncoder date/time ⇒ DateTimeEncoder non-string discrete ⇒ TargetEncoder ... learner = sklearn.pipeline.make pipeline( skrub.TableVectorizer(), sklearn.ensemble.HistGradientBoostingClassifier() ) Strong baseline G Varoquaux 13

Tables, not columns: the TableVectorizer tab vec = skrub.TableVectorizer() X
= tab vec.fit transform(df) Heuristics for different columns strings with ≥ 30 categories ⇒ StringEncoder date/time ⇒ DateTimeEncoder non-string discrete ⇒ TargetEncoder ... Play with demo code: GaelVaroquaux/skrub presentation 2025 02 preparation simple.py G Varoquaux 13

Tables are messy: the Cleaner tab vec = skrub.Cleaner() X
= tab vec.fit transform(df) Sanitizing types Detecting & converting datetimes Detecting & converings NA Included in TableVectorizer ... Useful as a pipeline component G Varoquaux 14

Users sklearn pipeline Want simple “actions” on data seeing immediate
modifications Pandas is way more expressive 0.0 0.5 1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M Real assembly is across tables G Varoquaux 15

But pandas’ code cannot be put in production cannot be
applied to new data cannot be tuned G Varoquaux 16

A real analysis isn’t like a sklearn pipeline aggregated products
= products .groupby(”basket˙ID”) .agg(”mean”).reset index() features = basket IDs.merge( aggregated products , on=”basket˙ID”) from sklearn.ensemble import ExtraTreesClassifier ExtraTreesClassifier(). f i t ( features , fraud flags) G Varoquaux 17

skrub DataOps: wrap this wrangling code # Wrap the inputs
of our skrub pipeline products = skrub.var(”products”, products df) baskets = skrub.var(”baskets”, baskets df) # Specify ”X” and ”y” variables for machine learning basket IDs = baskets[[”basket˙ID”]].skb.mark as X() fraud flags = baskets[”fraud˙flag”].skb.mark as y() aggregated products = products.groupby(”basket˙ID”) .agg(”mean”).reset index() features = basket IDs.merge(aggregated products , on=”basket˙ID”) predictions = features.skb.apply( ExtraTreesClassifier(), y=fraud flags) VAR 'baskets' X: GETITEM ['basket_ID'] y: GETITEM 'fraud_ﬂag' CALLMETHOD 'merge' VAR 'products' CALLMETHOD 'groupby' CALLMETHOD 'agg' CALLMETHOD 'reset_index' APPLY ExtraTreesClassiﬁer G Varoquaux 18

skrub DataOps: put this wrangling in production products = skrub.var(”products”,
products df) baskets = skrub.var(”baskets”, baskets df) # ... aggregated products = products.groupby(”basket˙ID”).agg(”mean”) reset index() features = basket IDs.merge(aggregated products , on=”basket˙ID” predictions = features.skb.apply(ExtraTreesClassifier(), y=frau # Apply a predictor to new data predictor = predictions.skb.make learner( f i t t e d =True) y pred = predictor.predict( –’baskets’: baskets df test , ’products’: products df test˝ ) G Varoquaux 19

skrub DataOps: cross-validate, tune this wrangling aggregated products = products.groupby(”basket˙ID”).agg(
skrub.choose from((”mean”, ”max”, ”count”))).reset index() features = basket IDs.merge(aggregated products , on=”basket˙ID” predictions = features.skb. apply(ExtraTreesClassifier(), y=fraud flags) search = predictions.skb. make grid search() search.plot results() G Varoquaux 20

skrub DataOps: cross-validate, tune this wrangling aggregated products = products.groupby(”basket˙ID”).agg(
skrub.choose from((”mean”, ”max”, ”count”))).reset index() features = basket IDs.merge(aggregated products , on=”basket˙ID” predictions = features.skb. apply(ExtraTreesClassifier(), y=fraud flags) search = predictions.skb. make grid search() search.plot results() G Varoquaux 20 Play with demo code: GaelVaroquaux/skrub presentation 2025 04 assembling simple.py

G Varoquaux 21

Addressing the validation bottleneck skore new proj = skore.Project(”project”) proj.put(skore.EstimatorReport(
RandomForestClassifier(), X train=X train , X test=X test , y t r a i n =y train , y test=y test , )) proj.put(skore.EstimatorReport( LogisticRegression(), X train=X train , X test=X test , y t r a i n =y train , y test=y test , )) proj.summarize() G Varoquaux 22

Great ML tools scikit-learn Democratizing predictive models skrub new Data
wrangling meets ML skore new Domain evaluation G Varoquaux 23

Spinning off :Probabl. Mission A non-captured data-science stack “start-up” like
structure Flexible hire, organization... Answering needs of industry Model Open-source core Services & integration G Varoquaux 24

skrub: machine learning with dataframes Visualizing dataframes: skrub.TableReport(df) Vectorizing dataframes
for machine learning skrub.TableVectorizer().fit transform(df) skrub.Cleaner().fit transform(df) All kind of column transformers (strings, dates, types) DataOps: arbitrary data wrangling Wrap any data transformation code G Varoquaux 25

References I P. Cerda and G. Varoquaux. Encoding high-cardinality string
categorical variables. Transactions in Knowledge and Data Engineering, 2020. G Varoquaux 26

Skrub: machine-learning with dataframes

Skrub: machine-learning with dataframes

More Decks by Gael Varoquaux

Other Decks in Technology

Featured

Transcript