class: title-slide, center, bottom # tidymodels ## LA RUG June meeting ### Julia Silge --- name: clouds2 background-image: url(images/Clouds2.jpg) background-size: cover --- template: clouds2 class: middle, center # <i class="fas fa-heart"></i> Many thanks to Alison Hill and Desirée De Leon for their contributions to this talk --- name: clouds class: center, middle background-image: url(images/Clouds.jpg) background-size: cover --- template: clouds ## .big-text[Hello] --- template: clouds class: middle, center ### Julia Silge <img style="border-radius: 50%;" src="https://github.com/juliasilge.png" width="150px"/> [
@juliasilge](https://github.com/juliasilge) [
@juliasilge](https://twitter.com/juliasilge) [
youtube.com/juliasilge](https://youtube.com/juliasilge) [
juliasilge.com](https://juliasilge.com) --- class: inverse, middle, center # Machine learning with tidymodels --- class: top, center background-image: url(images/intro.002.jpeg) background-size: cover --- class: top, center background-image: url(images/intro.003.jpeg) background-size: cover --- background-image: url(images/tm-org.png) background-size: contain --- ```r library(tidymodels) ## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ── ## ✓ broom 0.5.6 ✓ rsample 0.0.7 ## ✓ dials 0.0.7 ✓ tune 0.1.0 ## ✓ infer 0.5.2 ✓ workflows 0.1.1 ## ✓ parsnip 0.1.1 ✓ yardstick 0.0.6 ## ✓ recipes 0.1.12 ## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ── ## x scales::discard() masks purrr::discard() ## x dplyr::filter() masks stats::filter() ## x recipes::fixed() masks stringr::fixed() ## x dplyr::lag() masks stats::lag() ## x yardstick::spec() masks readr::spec() ## x recipes::step() masks stats::step() ``` --- class: middle, center, frame # Three topics for today What makes a model? Spend your data budget wisely Feature engineering --- class: title-slide, center, bottom # What makes a model? ## tidymodels --- name: train-love background-image: url(images/train.jpg) background-size: contain background-color: #f6f6f6 class: bottom Modeling in R has heterogeneous practices around model interfaces, fitting, and execution. --- class: middle, center, frame # parsnip <iframe src="https://parsnip.tidymodels.org" width="100%" height="400px"></iframe> --- class: middle, frame # .center[To specify a model in tidymodels] .right-column[ 1\. Pick a .display[model] 2\. Set the .display[mode] (if needed) 3\. Set the .display[engine] ] --- class: middle, frame .fade[ # .center[To specify a model in tidymodels] ] .right-column[ 1\. Pick a .display[model] .fade[ 2\. Set the .display[mode] (if needed) 3\. Set the .display[engine] ] ] --- class: middle, center, frame # 1\. Pick a .display[model] All available models are listed at <https://tidymodels.org/find/parsnip> <iframe src="https://tidymodels.org/find/parsnip" width="100%" height="400px"></iframe> --- class: middle .center[ # `linear_reg()` Specifies a linear regression model ] ```r linear_reg(penalty = NULL, mixture = NULL) ``` --- class: middle .center[ # `decision_tree()` Specifies a decision tree model ] ```r decision_tree(cost_complexity = NULL, tree_depth = NULL, min_n = NULL) ``` --- class: middle .center[ # `rand_forest()` Specifies a random forest model ] ```r rand_forest(mtry = NULL, trees = NULL, min_n = NULL) ``` --- class: middle, frame .fade[ # .center[To specify a model in tidymodels] ] .right-column[ .fade[ 1\. Pick a .display[model] ] 2\. Set the .display[mode] (if needed) .fade[ 3\. Set the .display[engine] ] ] --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r linear_reg() %>% set_mode(mode = "regression") ## Linear Regression Model Specification (regression) ``` --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r logistic_reg() %>% set_mode(mode = "classification") ## Logistic Regression Model Specification (classification) ``` --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r decision_tree() %>% set_mode(mode = "classification") ## Decision Tree Model Specification (classification) ``` --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r decision_tree() %>% set_mode(mode = "regression") ## Decision Tree Model Specification (regression) ``` --- class: middle, frame .fade[ # .center[To specify a model in tidymodels] ] .right-column[ .fade[ 1\. Pick a .display[model] 2\. Set the .display[mode] (if needed) ] 3\. Set the .display[engine] ] --- class: middle # `set_engine()` The same model type can be implemented by multiple computational engines ```r rand_forest() %>% set_engine("randomForest") ## Random Forest Model Specification (unknown) ## ## Computational engine: randomForest ``` --- class: middle # `set_engine()` The same model type can be implemented by multiple computational engines ```r rand_forest() %>% set_engine("ranger") ## Random Forest Model Specification (unknown) ## ## Computational engine: ranger ``` --- class: middle # `set_engine()` The same model type can be implemented by multiple computational engines ```r linear_reg() %>% set_engine("lm") ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- class: middle # `set_engine()` The same model type can be implemented by multiple computational engines ```r linear_reg() %>% set_engine("spark") ## Linear Regression Model Specification (regression) ## ## Computational engine: spark ``` --- class: middle, frame # .center[What makes a model?] ```r nearest_neighbor() %>% set_engine("kknn") %>% set_mode("regression") ## K-Nearest Neighbor Model Specification (regression) ## ## Computational engine: kknn ``` --- class: middle, frame # .center[Harmonize heterogeneous interfaces] |**parsnip** |**xgboost** |**C5.0** |**spark** | |:--------------|:--------------------|:------------|:-----------------------------------| |tree_depth |max_depth (6) |NA |max_depth (5) | |trees |nrounds (15) |trials (15) |max_iter (20) | |learn_rate |eta (0.3) |NA |step_size (0.1) | |mtry |colsample_bytree (1) |NA |feature_subset_strategy (1 or 5) | |min_n |min_child_weight (1) |minCases (2) |min_instances_per_node (1) | |loss_reduction |gamma (0) |NA |min_info_gain (0) | |sample_size |subsample (1) |sample (0) |subsampling_rate (1) | |stop_iter |early_stop |NA |NA | .footnote[https://parsnip.tidymodels.org/reference/boost_tree.html] --- class: title-slide, center, bottom # Spending your data budget ## tidymodels --- class: middle, center, frame # rsample <iframe src="https://rsample.tidymodels.org" width="100%" height="400px"></iframe> --- class: middle, center, frame # Data splitting <img src="index_files/figure-html/all-split-1.png" width="864" /> --- # `initial_split()` Splits data randomly into a single testing and a single training set ```r ames_split <- initial_split(ames, prop = 0.75) ames_split ## <Analysis/Assess/Total> ## <2198/732/2930> ``` --- # `training()` and `testing()` Create training and testing sets from an `rsplit` ```r ames_train <- training(ames_split) ames_train ## # A tibble: 2,198 x 81 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley ## <fct> <fct> <dbl> <int> <fct> <fct> ## 1 One_Story_… Resident… 141 31770 Pave No_A… ## 2 One_Story_… Resident… 80 11622 Pave No_A… ## 3 One_Story_… Resident… 81 14267 Pave No_A… ## 4 One_Story_… Resident… 93 11160 Pave No_A… ## 5 Two_Story_… Resident… 74 13830 Pave No_A… ## 6 Two_Story_… Resident… 78 9978 Pave No_A… ## 7 One_Story_… Resident… 41 4920 Pave No_A… ## 8 One_Story_… Resident… 43 5005 Pave No_A… ## 9 One_Story_… Resident… 39 5389 Pave No_A… ## 10 Two_Story_… Resident… 60 7500 Pave No_A… ## # … with 2,188 more rows, and 75 more variables: ## # Lot_Shape <fct>, Land_Contour <fct>, Utilities <fct>, ## # Lot_Config <fct>, Land_Slope <fct>, … ``` --- # `training()` and `testing()` Create training and testing sets from an `rsplit` ```r ames_test <- testing(ames_split) ames_test ## # A tibble: 732 x 81 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley ## <fct> <fct> <dbl> <int> <fct> <fct> ## 1 Two_Story_… Resident… 75 10000 Pave No_A… ## 2 Two_Story_… Resident… 63 8402 Pave No_A… ## 3 One_Story_… Resident… 0 6820 Pave No_A… ## 4 One_Story_… Resident… 88 11394 Pave No_A… ## 5 One_Story_… Resident… 0 12537 Pave No_A… ## 6 One_Story_… Resident… 70 10500 Pave No_A… ## 7 One_Story_… Resident… 26 5858 Pave No_A… ## 8 Two_Story_… Resident… 21 1680 Pave No_A… ## 9 One_Story_… Resident… 98 11478 Pave No_A… ## 10 One_Story_… Resident… 95 12182 Pave No_A… ## # … with 722 more rows, and 75 more variables: ## # Lot_Shape <fct>, Land_Contour <fct>, Utilities <fct>, ## # Lot_Config <fct>, Land_Slope <fct>, … ``` --- background-image: url(images/diamonds.jpg) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ ## The .display[testing set] is precious ## We can only use it once! ] --- template: clouds ## How can we use the training set to compare, evaluate, and tune models? --- background-image: url(https://www.tidymodels.org/start/resampling/img/resampling.svg) background-size: 60% --- ```r set.seed(123) vfold_cv(ames_train, strata = Sale_Price) ## # 10-fold cross-validation using stratification ## # A tibble: 10 x 2 ## splits id ## <list> <chr> ## 1 <split [2K/221]> Fold01 ## 2 <split [2K/220]> Fold02 ## 3 <split [2K/220]> Fold03 ## 4 <split [2K/220]> Fold04 ## 5 <split [2K/220]> Fold05 ## 6 <split [2K/220]> Fold06 ## 7 <split [2K/220]> Fold07 ## 8 <split [2K/220]> Fold08 ## 9 <split [2K/219]> Fold09 ## 10 <split [2K/218]> Fold10 ``` --- class: middle, center, inverse # Cross-validation --- background-image: url(images/cross-validation/Slide2.png) background-size: contain --- background-image: url(images/cross-validation/Slide3.png) background-size: contain --- background-image: url(images/cross-validation/Slide4.png) background-size: contain --- background-image: url(images/cross-validation/Slide5.png) background-size: contain --- background-image: url(images/cross-validation/Slide6.png) background-size: contain --- background-image: url(images/cross-validation/Slide7.png) background-size: contain --- background-image: url(images/cross-validation/Slide8.png) background-size: contain --- background-image: url(images/cross-validation/Slide9.png) background-size: contain --- background-image: url(images/cross-validation/Slide10.png) background-size: contain --- background-image: url(images/cross-validation/Slide11.png) background-size: contain --- ```r set.seed(123) vfold_cv(ames_train, strata = Sale_Price) ## # 10-fold cross-validation using stratification ## # A tibble: 10 x 2 ## splits id ## <list> <chr> ## 1 <split [2K/221]> Fold01 ## 2 <split [2K/220]> Fold02 ## 3 <split [2K/220]> Fold03 ## 4 <split [2K/220]> Fold04 ## 5 <split [2K/220]> Fold05 ## 6 <split [2K/220]> Fold06 ## 7 <split [2K/220]> Fold07 ## 8 <split [2K/220]> Fold08 ## 9 <split [2K/219]> Fold09 ## 10 <split [2K/218]> Fold10 ``` --- class: middle, center .center[ # Resampling methods .display[Spend your data wisely] to create simulated validation set ] ```r vfold_cv() loo_cv() mc_cv() bootstraps() ``` --- class: title-slide, center, bottom # Feature engineering ## tidymodels --- background-image: url(images/two-birds2-alpha.png) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ # Let's go back to the beginning ] --- class: middle, center, frame # recipes <iframe src="https://recipes.tidymodels.org" width="100%" height="400px"></iframe> --- background-image: url(images/garbage.jpg) background-size: contain background-position: right class: middle, center background-color: #f5f5f5 .pull-left[ # Build better predictors ] --- class: middle, center, frame # To build a recipe 1\. Start the `recipe()` 2\. Define the .display[variables] involved 3\. Describe preprocessing .display[step-by-step] --- class: middle, center # `recipe()` Creates a recipe for a set of variables ```r recipe(Sale_Price ~ ., data = ames) ``` --- class: middle, center # .center[`step_*()`] Complete list at <https://recipes.tidymodels.org/reference/> <iframe src="https://recipes.tidymodels.org/reference/" width="100%" height="400px"></iframe> --- background-image: url(images/cranes.jpg) background-position: left background-size: contain class: middle .right-column[ # Preprocessing options + Encode categorical predictors + Center and scale variables + Handle class imbalance + Impute missing data + Perform dimensionality reduction + *A lot more!* ] --- Estimate parameters for preprocessing using the .display[training data] ```r pca_rec <- recipe(Sale_Price ~ ., data = ames_train) %>% step_novel(all_nominal()) %>% step_dummy(all_nominal()) %>% step_zv(all_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_predictors(), num_comp = 5) ``` --- ```r prep(pca_rec) ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 80 ## ## Training data contained 2198 data points and no missing data. ## ## Operations: ## ## Novel factor level assignment for MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, Land_Contour, ... [trained] ## Dummy variables from MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, Land_Contour, Utilities, ... [trained] ## Zero variance filter removed MS_SubClass_new, MS_Zoning_new, Street_new, Alley_new, ... [trained] ## Centering and scaling for Lot_Frontage, Lot_Area, Year_Built, Year_Remod_Add, Mas_Vnr_Area, ... [trained] ## PCA extraction with Lot_Frontage, Lot_Area, Year_Built, Year_Remod_Add, Mas_Vnr_Area, ... [trained] ``` --- background-image: url(images/workflows/workflows.008.jpeg) background-size: contain --- template: clouds ## .big-text[More to learn!] --- template: clouds2 class: middle, right ## Thanks! [
tidymodels.org](https://tidymodels.org) [
supervised-ml-course.netlify.app](https://supervised-ml-course.netlify.app) [
www.feat.engineering](http://www.feat.engineering/) [
@tidymodels](https://github.com/tidymodels) [
@juliasilge](https://twitter.com/juliasilge) [
youtube.com/juliasilge](https://youtube.com/juliasilge)