59.271-155.699.001-214.959z"></path></svg>]( --- class: inverse, middle, center # Machine learning with tidymodels --- class: top, center background-image: url(images/intro.002.jpeg) background-size: cover --- class: top, center background-image: url(images/intro.003.jpeg) background-size: cover --- class: top, center background-image: url(images/all-of-ml.jpg) background-size: contain .footnote[Credit: <>] --- background-image: url(images/tm-org.png) background-size: contain --- ```r library(tidymodels) ## ── Attaching packages ────────────────────────── tidymodels 0.1.4 ── ## ✓ broom 0.7.12 ✓ rsample 0.1.1 ## ✓ dials 0.1.0 ✓ tune 0.1.6 ## ✓ infer 1.0.0 ✓ workflows 0.2.4 ## ✓ modeldata 0.1.1 ✓ workflowsets 0.1.0 ## ✓ parsnip 0.1.7 ✓ yardstick 0.0.9 ## ✓ recipes 0.1.17 ## ── Conflicts ───────────────────────────── tidymodels_conflicts() ── ## x scales::discard() masks purrr::discard() ## x dplyr::filter() masks stats::filter() ## x recipes::fixed() masks stringr::fixed() ## x dplyr::lag() masks stats::lag() ## x yardstick::spec() masks readr::spec() ## x recipes::step() masks stats::step() ## • Search for functions across packages at ``` --- class: middle, center, frame # Three topics for today What makes a model? Spend your data budget wisely Feature engineering --- class: title-slide, center, bottom # What makes a model? ## tidymodels --- name: train-love background-image: url(images/train.jpg) background-size: contain background-color: #f6f6f6 class: bottom Modeling in R has heterogeneous practices around model interfaces, fitting, and execution. --- class: middle, center, frame # parsnip <iframe src="" width="100%" height="400px" data-external="1"></iframe> --- class: middle, frame # .center[To specify a model in tidymodels] .right-column[ 1\. Pick a .display[model] 2\. Set the .display[mode] (if needed) 3\. Set the .display[engine] ] --- class: middle, frame .fade[ # .center[To specify a model in tidymodels] ] .right-column[ 1\. Pick a .display[model] .fade[ 2\. Set the .display[mode] (if needed) 3\. Set the .display[engine] ] ] --- class: middle, center, frame # 1\. Pick a .display[model] All available models are listed at <> <iframe src="" width="100%" height="400px" data-external="1"></iframe> --- class: middle .center[ # `linear_reg()` Specifies a linear regression model ] ```r linear_reg(penalty = NULL, mixture = NULL) ``` --- class: middle .center[ # `decision_tree()` Specifies a decision tree model ] ```r decision_tree(cost_complexity = NULL, tree_depth = NULL, min_n = NULL) ``` --- class: middle .center[ # `rand_forest()` Specifies a random forest model ] ```r rand_forest(mtry = NULL, trees = NULL, min_n = NULL) ``` --- class: middle, frame .fade[ # .center[To specify a model in tidymodels] ] .right-column[ .fade[ 1\. Pick a .display[model] ] 2\. Set the .display[mode] (if needed) .fade[ 3\. Set the .display[engine] ] ] --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r linear_reg() %>% set_mode(mode = "regression") ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r logistic_reg() %>% set_mode(mode = "classification") ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r decision_tree() %>% set_mode(mode = "classification") ## Decision Tree Model Specification (classification) ## ## Computational engine: rpart ``` --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r decision_tree() %>% set_mode(mode = "regression") ## Decision Tree Model Specification (regression) ## ## Computational engine: rpart ``` --- class: middle, frame .fade[ # .center[To specify a model in tidymodels] ] .right-column[ .fade[ 1\. Pick a .display[model] 2\. Set the .display[mode] (if needed) ] 3\. Set the .display[engine] ] --- class: middle # `set_engine()` The same model type can be implemented by multiple computational engines ```r rand_forest() %>% set_engine("randomForest") ## Random Forest Model Specification (unknown) ## ## Computational engine: randomForest ``` --- class: middle # `set_engine()` The same model type can be implemented by multiple computational engines ```r rand_forest() %>% set_engine("ranger") ## Random Forest Model Specification (unknown) ## ## Computational engine: ranger ``` --- class: middle # `set_engine()` The same model type can be implemented by multiple computational engines ```r linear_reg() %>% set_engine("lm") ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- class: middle # `set_engine()` The same model type can be implemented by multiple computational engines ```r linear_reg() %>% set_engine("spark") ## Linear Regression Model Specification (regression) ## ## Computational engine: spark ``` --- class: middle, frame # .center[What makes a model?] ```r nearest_neighbor() %>% set_engine("kknn") %>% set_mode("regression") ## K-Nearest Neighbor Model Specification (regression) ## ## Computational engine: kknn ``` --- class: middle, frame # .center[Harmonize heterogeneous interfaces] |**parsnip** |**xgboost** |**C5.0** |**spark** | |:--------------|:--------------------|:------------|:-----------------------------------| |tree_depth |max_depth (6) |NA |max_depth (5) | |trees |nrounds (15) |trials (15) |max_iter (20) | |learn_rate |eta (0.3) |NA |step_size (0.1) | |mtry |colsample_bytree (1) |NA |feature_subset_strategy (1 or 5) | |min_n |min_child_weight (1) |minCases (2) |min_instances_per_node (1) | |loss_reduction |gamma (0) |NA |min_info_gain (0) | |sample_size |subsample (1) |sample (0) |subsampling_rate (1) | |stop_iter |early_stop |NA |NA | --- class: title-slide, center, bottom # Spending your data budget ## tidymodels --- class: middle, center, frame # rsample <iframe src="" width="100%" height="400px" data-external="1"></iframe> --- class: middle, center, frame # Data splitting <img src="index_files/figure-html/all-split-1.png" width="864" /> --- # `initial_split()` Splits data randomly into a single testing and a single training set ```r ames_split <- initial_split(ames, prop = 0.75) ames_split ## <Analysis/Assess/Total> ## <2197/733/2930> ``` --- # `training()` and `testing()` Create training and testing sets from an `rsplit` ```r ames_train <- training(ames_split) ames_train ## # A tibble: 2,197 × 81 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley ## <fct> <fct> <dbl> <int> <fct> <fct> ## 1 One_Story_19… Resident… 80 11600 Pave No_A… ## 2 One_Story_19… Resident… 0 9500 Pave No_A… ## 3 One_and_Half… Resident… 60 8400 Pave No_A… ## 4 One_Story_19… Resident… 75 9000 Pave No_A… ## 5 Two_Story_19… Resident… 80 10791 Pave No_A… ## # … with 2,192 more rows, and 75 more variables: ## # Lot_Shape <fct>, Land_Contour <fct>, Utilities <fct>, ## # Lot_Config <fct>, Land_Slope <fct>, … ``` --- # `training()` and `testing()` Create training and testing sets from an `rsplit` ```r ames_test <- testing(ames_split) ames_test ## # A tibble: 733 × 81 ## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley ## <fct> <fct> <dbl> <int> <fct> <fct> ## 1 One_Story_19… Resident… 80 11622 Pave No_A… ## 2 One_Story_19… Resident… 81 14267 Pave No_A… ## 3 Two_Story_19… Resident… 74 13830 Pave No_A… ## 4 One_Story_PU… Resident… 41 4920 Pave No_A… ## 5 One_Story_PU… Resident… 43 5005 Pave No_A… ## # … with 728 more rows, and 75 more variables: ## # Lot_Shape <fct>, Land_Contour <fct>, Utilities <fct>, ## # Lot_Config <fct>, Land_Slope <fct>, … ``` --- background-image: url(images/diamonds.jpg) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ ## The .display[testing set] is precious ## We can only use it once! ] --- template: clouds ## How can we use the training set to compare, evaluate, and tune models? --- background-image: url( background-size: 60% --- ```r set.seed(123) vfold_cv(ames_train, strata = Sale_Price) ## # 10-fold cross-validation using stratification ## # A tibble: 10 × 2 ## splits id ## <list> <chr> ## 1 <split [1975/222]> Fold01 ## 2 <split [1976/221]> Fold02 ## 3 <split [1976/221]> Fold03 ## 4 <split [1977/220]> Fold04 ## 5 <split [1977/220]> Fold05 ## 6 <split [1978/219]> Fold06 ## 7 <split [1978/219]> Fold07 ## 8 <split [1978/219]> Fold08 ## 9 <split [1979/218]> Fold09 ## 10 <split [1979/218]> Fold10 ``` --- class: middle, center, inverse # Cross-validation --- background-image: url(images/cross-validation/Slide2.png) background-size: contain --- background-image: url(images/cross-validation/Slide3.png) background-size: contain --- background-image: url(images/cross-validation/Slide4.png) background-size: contain --- background-image: url(images/cross-validation/Slide5.png) background-size: contain --- background-image: url(images/cross-validation/Slide6.png) background-size: contain --- background-image: url(images/cross-validation/Slide7.png) background-size: contain --- background-image: url(images/cross-validation/Slide8.png) background-size: contain --- background-image: url(images/cross-validation/Slide9.png) background-size: contain --- background-image: url(images/cross-validation/Slide10.png) background-size: contain --- background-image: url(images/cross-validation/Slide11.png) background-size: contain --- ```r set.seed(123) vfold_cv(ames_train, strata = Sale_Price) ## # 10-fold cross-validation using stratification ## # A tibble: 10 × 2 ## splits id ## <list> <chr> ## 1 <split [1975/222]> Fold01 ## 2 <split [1976/221]> Fold02 ## 3 <split [1976/221]> Fold03 ## 4 <split [1977/220]> Fold04 ## 5 <split [1977/220]> Fold05 ## 6 <split [1978/219]> Fold06 ## 7 <split [1978/219]> Fold07 ## 8 <split [1978/219]> Fold08 ## 9 <split [1979/218]> Fold09 ## 10 <split [1979/218]> Fold10 ``` --- class: middle, center .center[ # Resampling methods .display[Spend your data wisely] to create simulated validation set ] ```r vfold_cv() loo_cv() mc_cv() bootstraps() ``` --- class: title-slide, center, bottom # Feature engineering ## tidymodels --- background-image: url(images/two-birds2-alpha.png) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ # Let's go back to the beginning ] --- class: middle, center, frame # recipes <iframe src="" width="100%" height="400px" data-external="1"></iframe> --- background-image: url(images/garbage.jpg) background-size: contain background-position: right class: middle, center background-color: #f5f5f5 .pull-left[ # Build better predictors ] --- class: middle, center, frame # To build a recipe 1\. Start the `recipe()` 2\. Define the .display[variables] involved 3\. Describe preprocessing .display[step-by-step] --- class: middle, center # `recipe()` Creates a recipe for a set of variables ```r recipe(Sale_Price ~ ., data = ames) ``` --- class: middle, center # .center[`step_*()`] Complete list at <> <iframe src="" width="100%" height="400px" data-external="1"></iframe> --- background-image: url(images/cranes.jpg) background-position: left background-size: contain class: middle .right-column[ # Preprocessing options + Encode categorical predictors + Center and scale variables + Handle class imbalance + Impute missing data + Perform dimensionality reduction + *A lot more!* ] --- Estimate parameters for preprocessing using the .display[training data] ```r pca_rec <- recipe(Sale_Price ~ ., data = ames_train) %>% step_novel(all_nominal()) %>% step_dummy(all_nominal()) %>% step_zv(all_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_predictors(), num_comp = 5) ``` --- ```r prep(pca_rec) ## Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 80 ## ## Training data contained 2197 data points and no missing data. ## ## Operations: ## ## Novel factor level assignment for MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, Land_Contour, Utilities,... [trained] ## Dummy variables from MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, Land_Contour, Utilities, Lot_Confi... [trained] ## Zero variance filter removed MS_SubClass_new, MS_Zoning_new, Street_new, Alley_new, Lot_Shape_n... [trained] ## Centering and scaling for Lot_Frontage, Lot_Area, Year_Built, Year_Remod_Add, Mas_Vnr_Area, BsmtFin_... [trained] ## PCA extraction with Lot_Frontage, Lot_Area, Year_Built, Year_Remod_Add, Mas_Vnr_Area, BsmtFin_S... [trained] ``` --- background-image: url(images/workflows/workflows.008.jpeg) background-size: contain --- template: clouds ## .big-text[More to learn!] --- template: clouds2 class: middle, right ## Thanks! 