a matrix (except when it needs to be a data.frame) - Must use formula or x/y (or both) - Inconsistent naming of arguments (ntree in randomForest, num.trees in ranger) - na.omit explicitly or silently - May or may not accept factors 11 / 39
a matrix (except when it needs to be a data.frame) - Must use formula or x/y (or both) - Inconsistent naming of arguments (ntree in randomForest, num.trees in ranger) - na.omit explicitly or silently - May or may not accept factors 11 / 39
cation from the computational engine - Separate the de nition of a model from its evaluation - Harmonize argument names - Make consistent predictions (always tibbles with na.omit=FALSE) 14 / 39
relation ship between variables fit_glm <- model_glm %>% fit(factor(am) ~ poly(mpg, 3) + pca(disp:wt)[1] + pca(disp:wt)[2] + pca(disp:wt)[3], data = mtcars) - Not all inline functions can be used with formulas - Having to run some calculations many many times - Connected to the model, calculations are not saved between models Post by Max Kuhn about the bad sides of formula https://rviews.rstudio.com/2017/03/01/the-r-formula-method-the-bad-parts/ 20 / 39
with before you can start modeling - Same unit (center and scale) - Remove correlation ( lter and PCA extraction) - Missing data (imputation) - Dummy varibles - Zero Variance 22 / 39
almost finished and I didn't want to change the data in all the other slides big_mtcars <- rerun(10, mtcars) %>% bind_rows() data_split <- initial_split(big_mtcars, strata = "mpg", p = 0.80) # Training and test data cars_train <- training(data_split) cars_test <- testing(data_split) car_prep <- prep(car_rec, training = cars_train) # Preprocessed data cars_train_p <- juice(car_prep) cars_test_p <-bake(car_prep, new_data = cars_test) 31 / 39