Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applied machine learning with tidymodels

Applied machine learning with tidymodels

useR! 2022 keynote

Avatar for Julia Silge

Julia Silge

June 22, 2022
Tweet

More Decks by Julia Silge

Other Decks in Technology

Transcript

  1. A pl ed ac in L ar in w th

    t dy od ls J li S lg @j l
  2. W at's he ar es p rt bo t ac

    in l ar in i p ac ic ? @j l
  3. library(tidymodels) #> ── Attaching packages ────────────────────────────────────────────── tidymodels 0.2.0 ── #>

    ✔ broom 0.8.0 ✔ rsample 0.1.1 #> ✔ dials 1.0.0 ✔ tibble 3.1.7 #> ✔ dplyr 1.0.9 ✔ tidyr 1.2.0 #> ✔ infer 1.0.2 ✔ tune 0.2.0 #> ✔ modeldata 0.1.1 ✔ workflows 0.2.6 #> ✔ parsnip 1.0.0 ✔ workflowsets 0.2.1 #> ✔ purrr 0.3.4 ✔ yardstick 1.0.0 #> ✔ recipes 0.2.0 #> ── Conflicts ───────────────────────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Dig deeper into tidy modeling with R at https://www.tmwr.org @j l
  4. T re t pi s or od y 4 S

    u t b 4 W u m s n e 4 G u m o u l @j l
  5. initial_split() S t r y t a t n t

    g s penguins_split <- initial_split(penguins, prop = 0.75) penguins_split #> <Training/Testing/Total> #> <249/84/333> @j l
  6. training() a d testing() C t g n t t

    o rsplit penguins_train <- training(penguins_split) penguins_train #> # A tibble: 249 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex #> <fct> <fct> <dbl> <dbl> <int> <int> <fct> #> 1 Chinst… Dream 47.6 18.3 195 3850 fema… #> 2 Adelie Torge… 35.7 17 189 3350 fema… #> 3 Gentoo Biscoe 45.5 15 220 5000 male #> 4 Gentoo Biscoe 48.7 15.7 208 5350 male #> 5 Gentoo Biscoe 46.5 13.5 210 4550 fema… #> # … with 244 more rows, and 1 more variable: year <int> @j l
  7. training() a d testing() C t g n t t

    o rsplit penguins_test <- testing(penguins_split) penguins_test #> # A tibble: 84 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex #> <fct> <fct> <dbl> <dbl> <int> <int> <fct> #> 1 Adelie Torge… 40.3 18 195 3250 fema… #> 2 Adelie Torge… 36.7 19.3 193 3450 fema… #> 3 Adelie Torge… 36.6 17.8 185 3700 fema… #> 4 Adelie Torge… 34.4 18.4 184 3325 fema… #> 5 Adelie Torge… 46 21.5 194 4200 male #> # … with 79 more rows, and 1 more variable: year <int> @j l
  8. H w an e se he ra ni g et

    o c mp re, e al at , a d un m de s? @j l
  9. C os -v li at on 14 18 28 17

    21 25 22 8 6 30 1 23 27 3 2 19 11 7 26 24 16 9 4 29 20 12 13 15 5 10 14 18 28 17 21 25 22 8 6 30 1 23 27 3 2 19 11 7 26 24 16 9 4 29 20 12 13 15 5 10 @j l
  10. C os -v li at on Model Fit Using Estimate

    Performance Using Fold 1 Iteration Fold 2 Iteration Fold 3 Iteration 14 29 17 20 21 8 24 28 3 1 13 26 16 9 5 30 19 15 6 12 27 22 23 25 2 18 7 4 11 10 11 28 18 22 23 7 25 27 4 2 10 26 16 8 5 29 20 13 6 9 30 19 21 24 1 17 12 3 15 14 14 27 18 21 22 7 23 25 2 1 12 24 17 10 3 30 19 15 4 11 29 20 26 28 5 16 8 6 13 9 @j l
  11. C os -v li at on set.seed(123) vfold_cv(penguins_train, strata =

    species) #> # 10-fold cross-validation using stratification #> # A tibble: 10 × 2 #> splits id #> <list> <chr> #> 1 <split [223/26]> Fold01 #> 2 <split [223/26]> Fold02 #> 3 <split [223/26]> Fold03 #> 4 <split [224/25]> Fold04 #> 5 <split [224/25]> Fold05 #> 6 <split [224/25]> Fold06 #> 7 <split [225/24]> Fold07 #> 8 <split [225/24]> Fold08 #> 9 <split [225/24]> Fold09 #> 10 <split [225/24]> Fold10 @j l
  12. B ot tr pp ng Model Fit Using Estimate Performance

    Using Bootstrap Iteration 1 16 19 27 19 23 25 23 13 8 29 1 24 25 4 1 21 14 10 25 23 17 13 7 28 22 15 16 16 8 13 18 28 26 30 3 9 2 24 5 11 12 20 6 12 15 27 14 18 23 21 4 4 30 2 22 28 3 2 17 7 4 23 22 14 6 3 28 17 10 11 12 3 6 20 29 5 13 1 26 8 16 19 24 9 15 19 22 18 20 21 20 5 5 30 2 21 22 3 2 19 10 5 21 21 18 6 3 29 20 11 12 16 4 7 24 28 27 8 14 1 26 9 17 23 25 13 Bootstrap Iteration 2 Bootstrap Iteration 3 @j l
  13. B ot tr pp ng set.seed(123) bootstraps(penguins_train, strata = species)

    #> # Bootstrap sampling using stratification #> # A tibble: 25 × 2 #> splits id #> <list> <chr> #> 1 <split [249/91]> Bootstrap01 #> 2 <split [249/93]> Bootstrap02 #> 3 <split [249/96]> Bootstrap03 #> 4 <split [249/88]> Bootstrap04 #> 5 <split [249/89]> Bootstrap05 #> 6 <split [249/82]> Bootstrap06 #> 7 <split [249/87]> Bootstrap07 #> 8 <split [249/87]> Bootstrap08 #> 9 <split [249/85]> Bootstrap09 #> 10 <split [249/95]> Bootstrap10 #> # … with 15 more rows @j l
  14. R sa pl ng et od S u t w

    t c ea e im la ed al da io s t(s) vfold_cv() loo_cv() mc_cv() bootstraps() validation_split() @j l
  15. w rk o s h tp ://w rf lo s.t

    dy od ls.o g/ @j l
  16. W er d es ou m de s ar a

    d nd? rf_spec <- rand_forest(mode = "classification") penguin_formula <- species ~ bill_length_mm + bill_depth_mm + sex @j l
  17. W er d es ou m de s ar a

    d nd? workflow(penguin_formula, rf_spec) #> ══ Workflow ════════════════════════════════════════════════════════════════════════════ #> Preprocessor: Formula #> Model: rand_forest() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────────────── #> species ~ bill_length_mm + bill_depth_mm + sex #> #> ── Model ─────────────────────────────────────────────────────────────────────────────── #> Random Forest Model Specification (classification) #> #> Computational engine: ranger @j l
  18. W er d es ou m de s ar a

    d nd? workflow(penguin_formula, rf_spec) %>% fit(data = penguins_train) #> ══ Workflow [trained] ══════════════════════════════════════════════════════════════════ #> Preprocessor: Formula #> Model: rand_forest() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────────────── #> species ~ bill_length_mm + bill_depth_mm + sex #> #> ── Model ─────────────────────────────────────────────────────────────────────────────── #> Ranger result #> #> Call: #> ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, #> verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) #> #> Type: Probability estimation #> Number of trees: 500 #> Sample size: 249 #> Number of independent variables: 3 #> Mtry: 1 #> Target node size: 10 #> Variable importance mode: none #> Splitrule: gini #> OOB prediction error (Brier s.): 0.05585744 @j l
  19. W er d es ou m de s ar a

    d nd? penguin_rec <- recipe(species ~ bill_length_mm + bill_depth_mm + sex, data = penguins_train) %>% step_dummy(sex) %>% step_normalize(all_numeric_predictors()) penguin_rec #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 3 #> #> Operations: #> #> Dummy variables from sex #> Centering and scaling for all_numeric_predictors() @j l
  20. W er d es ou m de s ar a

    d nd? svm_spec <- svm_linear(mode = "classification") workflow(penguin_rec, svm_spec) #> ══ Workflow ════════════════════════════════════════════════════════════════════════════ #> Preprocessor: Recipe #> Model: svm_linear() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────────────── #> 2 Recipe Steps #> #> • step_dummy() #> • step_normalize() #> #> ── Model ─────────────────────────────────────────────────────────────────────────────── #> Linear Support Vector Machine Specification (classification) #> #> Computational engine: LiblineaR @j l
  21. W er d es ou m de s ar a

    d nd? penguin_fit <- workflow(penguin_rec, svm_spec) %>% fit(data = penguins_train) @j l
  22. G t ou m de o y ur ap op

    library(vetiver) v <- vetiver_model(penguin_fit, "svm_penguins") v #> #> ── svm_penguins ─ <butchered_workflow> model for deployment #> A LiblineaR classification modeling workflow using 3 features @j l
  23. G t ou m de o y ur ap op

    library(plumber) pr() %>% vetiver_api(v) #> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router. #> # Use `pr_run()` on this object to start the API. #> ├──[queryString] #> ├──[body] #> ├──[cookieParser] #> ├──[sharedSecret] #> ├──/logo #> ├──/ping (GET) #> └──/predict (POST) @j l
  24. G t ou m de o y ur ap op

    4 P -b d R C 4 G e D l o o d e t @j l
  25. G t ou m de o y ur ap op

    # Generated by the vetiver package; edit with care FROM rocker/r-ver:4.2.0 ENV RENV_CONFIG_REPOS_OVERRIDE https://packagemanager.rstudio.com/cran/latest RUN apt-get update -qq && apt-get install -y --no-install-recommends \ libcurl4-openssl-dev \ libicu-dev \ libsodium-dev \ libssl-dev \ make COPY vetiver_renv.lock renv.lock RUN Rscript -e "install.packages('renv')" RUN Rscript -e "renv::restore()" COPY plumber.R /opt/ml/plumber.R EXPOSE 8000 ENTRYPOINT ["R", "-e", "pr <- plumber::plumb('/opt/ml/plumber.R'); pr$run(host = '0.0.0.0', port = 8000)"] @j l
  26. T an y u! h ://y .c /j l /

    h ://j l .c / h ://t e .o / h ://t .o / P a M U h