Machine learning with tidymodels
⚠️ DANGERS OF OVERFITTING ⚠️
tree_fit %>%
augment(ring_train)
#> # A tibble: 3,340 × 10
#> sex length diameter height whole_wei…¹ shuck…² visce…³ shell…⁴ rings .pred
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 male 0.35 0.265 0.09 0.226 0.0995 0.0485 0.07 7 8.70
#> 2 infant 0.33 0.255 0.08 0.205 0.0895 0.0395 0.055 7 7.04
#> 3 infant 0.355 0.28 0.085 0.290 0.095 0.0395 0.115 7 8.41
#> 4 male 0.365 0.295 0.08 0.256 0.097 0.043 0.1 7 8.70
#> 5 male 0.465 0.355 0.105 0.480 0.227 0.124 0.125 8 8.92
#> 6 female 0.45 0.355 0.105 0.522 0.237 0.116 0.145 8 8.92
#> 7 infant 0.24 0.175 0.045 0.07 0.0315 0.0235 0.02 5 4.5
#> 8 infant 0.205 0.15 0.055 0.042 0.0255 0.015 0.012 5 4.5
#> 9 infant 0.21 0.15 0.05 0.042 0.0175 0.0125 0.015 4 4.5
#> 10 infant 0.39 0.295 0.095 0.203 0.0875 0.045 0.075 7 7.04
#> # … with 3,330 more rows, and abbreviated variable names ¹whole_weight,
#> # ²shucked_weight, ³viscera_weight, ⁴shell_weight
#> # ℹ Use `print(n = ...)` to see more rows
We call this “resubstitution” or “repredicting the training set”
We call this a “resubstitution estimate”
⚠️ Remember that we’re demonstrating overfitting
⚠️ Don’t use the test set until the end of your modeling analysis
Use augment()
to compute a regression metric like mae()
.
Compute the metrics for both training and testing data.
Notice the evidence of overfitting! ⚠️
05:00
What if we want to compare more models?
And/or more model configurations?
And we want to understand if these are important differences?
If we use 10 folds, what percent of the training data
for each fold?
03:00
vfold_cv(ring_train) # v = 10 is default
#> # 10-fold cross-validation
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [3006/334]> Fold01
#> 2 <split [3006/334]> Fold02
#> 3 <split [3006/334]> Fold03
#> 4 <split [3006/334]> Fold04
#> 5 <split [3006/334]> Fold05
#> 6 <split [3006/334]> Fold06
#> 7 <split [3006/334]> Fold07
#> 8 <split [3006/334]> Fold08
#> 9 <split [3006/334]> Fold09
#> 10 <split [3006/334]> Fold10
What is in this?
vfold_cv(ring_train, strata = rings)
#> # 10-fold cross-validation using stratification
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [3004/336]> Fold01
#> 2 <split [3005/335]> Fold02
#> 3 <split [3005/335]> Fold03
#> 4 <split [3005/335]> Fold04
#> 5 <split [3005/335]> Fold05
#> 6 <split [3006/334]> Fold06
#> 7 <split [3007/333]> Fold07
#> 8 <split [3007/333]> Fold08
#> 9 <split [3008/332]> Fold09
#> 10 <split [3008/332]> Fold10
Stratification often helps, with very little downside
We’ll use this setup:
set.seed(234)
ring_folds <- vfold_cv(ring_train, v = 5, strata = rings)
ring_folds
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2670/670]> Fold1
#> 2 <split [2672/668]> Fold2
#> 3 <split [2672/668]> Fold3
#> 4 <split [2673/667]> Fold4
#> 5 <split [2673/667]> Fold5
Set the seed when creating resamples
tree_res <- fit_resamples(tree_wflow, ring_folds)
tree_res
#> # Resampling results
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [2670/670]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [2672/668]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
#> 3 <split [2672/668]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
#> 4 <split [2673/667]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
#> 5 <split [2673/667]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>
We can reliably measure performance using only the training data 🎉
How do the metrics from resampling compare to the metrics from training and testing?
The RMSE previously was
Remember that:
⚠️ the training set gives you overly optimistic metrics
⚠️ the test set is precious
# save the assessment set results
ctrl_abalone <- control_resamples(save_pred = TRUE)
tree_res <- fit_resamples(tree_wflow, ring_folds, control = ctrl_abalone)
tree_preds <- collect_predictions(tree_res)
tree_preds
#> # A tibble: 3,340 × 5
#> id .pred .row rings .config
#> <chr> <dbl> <int> <dbl> <chr>
#> 1 Fold1 7.79 1 7 Preprocessor1_Model1
#> 2 Fold1 8.39 3 7 Preprocessor1_Model1
#> 3 Fold1 7.06 10 7 Preprocessor1_Model1
#> 4 Fold1 9.92 23 7 Preprocessor1_Model1
#> 5 Fold1 9.93 24 8 Preprocessor1_Model1
#> 6 Fold1 7.06 25 7 Preprocessor1_Model1
#> 7 Fold1 7.06 30 8 Preprocessor1_Model1
#> 8 Fold1 8.74 34 8 Preprocessor1_Model1
#> 9 Fold1 9.47 38 8 Preprocessor1_Model1
#> 10 Fold1 4.36 39 5 Preprocessor1_Model1
#> # … with 3,330 more rows
#> # ℹ Use `print(n = ...)` to see more rows
tree_res
#> # Resampling results
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 × 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [2670/670]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [670 × 4]>
#> 2 <split [2672/668]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [668 × 4]>
#> 3 <split [2672/668]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [668 × 4]>
#> 4 <split [2673/667]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [667 × 4]>
#> 5 <split [2673/667]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [667 × 4]>
Where are the fitted models??!??
tree_res
#> # Resampling results
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 × 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [2670/670]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [670 × 4]>
#> 2 <split [2672/668]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [668 × 4]>
#> 3 <split [2672/668]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [668 × 4]>
#> 4 <split [2673/667]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [667 × 4]>
#> 5 <split [2673/667]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [667 × 4]>
Where are the fitted models??!?? 🗑️
For more advanced use cases, you can extract and save them.
Resampling can involve fitting a lot of models!
These models don’t depend on one another and can be run in parallel
We can use a parallel backend to do this:
set.seed(123)
bootstraps(ring_train)
#> # Bootstrap sampling
#> # A tibble: 25 × 2
#> splits id
#> <list> <chr>
#> 1 <split [3340/1213]> Bootstrap01
#> 2 <split [3340/1230]> Bootstrap02
#> 3 <split [3340/1233]> Bootstrap03
#> 4 <split [3340/1232]> Bootstrap04
#> 5 <split [3340/1207]> Bootstrap05
#> 6 <split [3340/1219]> Bootstrap06
#> 7 <split [3340/1242]> Bootstrap07
#> 8 <split [3340/1218]> Bootstrap08
#> 9 <split [3340/1234]> Bootstrap09
#> 10 <split [3340/1237]> Bootstrap10
#> # … with 15 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Create:
times
from the default)Don’t forget to set a seed when you resample!
05:00
set.seed(322)
bootstraps(ring_train, times = 10)
#> # Bootstrap sampling
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [3340/1217]> Bootstrap01
#> 2 <split [3340/1230]> Bootstrap02
#> 3 <split [3340/1269]> Bootstrap03
#> 4 <split [3340/1221]> Bootstrap04
#> 5 <split [3340/1254]> Bootstrap05
#> 6 <split [3340/1224]> Bootstrap06
#> 7 <split [3340/1200]> Bootstrap07
#> 8 <split [3340/1200]> Bootstrap08
#> 9 <split [3340/1224]> Bootstrap09
#> 10 <split [3340/1220]> Bootstrap10
A validation set is just another type of resample
Ensemble many decision tree models
All the trees vote! 🗳️
Bootstrap aggregating + random predictor sampling
Random forest often works well without tuning hyperparameters (more on this later!), as long as there are enough trees
rf_wflow <- workflow(rings ~ ., rf_spec)
rf_wflow
#> ══ Workflow ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: rand_forest()
#>
#> ── Preprocessor ──────────────────────────────────────────────────────
#> rings ~ .
#>
#> ── Model ─────────────────────────────────────────────────────────────
#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> trees = 1000
#>
#> Computational engine: ranger
Use fit_resamples()
and rf_wflow
to:
08:00
ctrl_abalone <- control_resamples(save_pred = TRUE)
# random forest uses random numbers so set the seed first
set.seed(2)
rf_res <- fit_resamples(rf_wflow, ring_folds, control = ctrl_abalone)
collect_metrics(rf_res)
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 rmse standard 2.17 5 0.0622 Preprocessor1_Model1
#> 2 rsq standard 0.548 5 0.0153 Preprocessor1_Model1
workflow_set(list(rings ~ .), list(tree_spec, rf_spec)) %>%
workflow_map("fit_resamples", resamples = ring_folds)
#> # A workflow set/tibble: 2 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 formula_decision_tree <tibble [1 × 4]> <opts[1]> <rsmp[+]>
#> 2 formula_rand_forest <tibble [1 × 4]> <opts[1]> <rsmp[+]>
workflow_set(list(rings ~ .), list(tree_spec, rf_spec)) %>%
workflow_map("fit_resamples", resamples = ring_folds) %>%
rank_results()
#> # A tibble: 4 × 9
#> wflow_id .config .metric mean std_err n prepr…¹ model rank
#> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
#> 1 formula_rand_forest Preproc… rmse 2.17 0.0642 5 formula rand… 1
#> 2 formula_rand_forest Preproc… rsq 0.546 0.0158 5 formula rand… 1
#> 3 formula_decision_tree Preproc… rmse 2.43 0.0582 5 formula deci… 2
#> 4 formula_decision_tree Preproc… rsq 0.452 0.0158 5 formula deci… 2
#> # … with abbreviated variable name ¹preprocessor
Change the metric using for ranking with the rank_metric
to argument
Lots more available with workflow sets, like collect_metrics()
, autoplot()
methods, and more!
When do you think a workflow set would be useful?
03:00
Suppose that we choose to use our random forest model.
Let’s fit the model on the training set and verify our performance using the test set.
We’ve shown you fit()
and predict()
(+ augment()
) but there is a shortcut:
# ring_split has train + test info
final_fit <- last_fit(rf_wflow, ring_split)
final_fit
#> # Resampling results
#> # Manual resampling
#> # A tibble: 1 × 6
#> splits id .metrics .notes .predictions .workflow
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [3340/837]> train/test split <tibble> <tibble> <tibble> <workflow>
final_fit
? These are metrics computed with the test set
final_fit
? collect_predictions(final_fit)
#> # A tibble: 837 × 5
#> id .pred .row rings .config
#> <chr> <dbl> <int> <dbl> <chr>
#> 1 train/test split 10.5 3 9 Preprocessor1_Model1
#> 2 train/test split 8.46 6 8 Preprocessor1_Model1
#> 3 train/test split 8.78 9 9 Preprocessor1_Model1
#> 4 train/test split 10.9 13 11 Preprocessor1_Model1
#> 5 train/test split 8.32 20 9 Preprocessor1_Model1
#> 6 train/test split 10.5 25 10 Preprocessor1_Model1
#> 7 train/test split 11.0 28 12 Preprocessor1_Model1
#> 8 train/test split 11.3 29 15 Preprocessor1_Model1
#> 9 train/test split 10.7 33 18 Preprocessor1_Model1
#> 10 train/test split 10.5 39 11 Preprocessor1_Model1
#> # … with 827 more rows
#> # ℹ Use `print(n = ...)` to see more rows
These are predictions for the test set
final_fit
? extract_workflow(final_fit)
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: rand_forest()
#>
#> ── Preprocessor ──────────────────────────────────────────────────────
#> rings ~ .
#>
#> ── Model ─────────────────────────────────────────────────────────────
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
#>
#> Type: Regression
#> Number of trees: 1000
#> Sample size: 3340
#> Number of independent variables: 8
#> Mtry: 2
#> Target node size: 5
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 4.681506
#> R squared (OOB): 0.5492882
Use this for prediction on new data, like for deploying
Working with a classification model?
Classification metrics are different, and may be more complicated
Different classification metrics are appropriate depending on your use case
Before lunch discussion!
Which model do you think you would decide to use?
What surprised you the most?
What is one thing you are looking forward to next?
05:00