4 - Evaluating models

Machine learning with tidymodels

Metrics for model performance

augment(tree_fit, new_data = ring_test) %>%
  metrics(rings, .pred)
#> # A tibble: 3 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard       2.31 
#> 2 rsq     standard       0.508
#> 3 mae     standard       1.62
  • RMSE: difference between the predicted and observed values ⬇️
  • \(R^2\): squared correlation between the predicted and observed values ⬆️
  • MAE: similar to RMSE, but mean absolute error ⬇️

Metrics for model performance

augment(tree_fit, new_data = ring_test) %>%
  rmse(rings, .pred)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        2.31

Metrics for model performance

augment(tree_fit, new_data = ring_test) %>%
  group_by(sex) %>%
  rmse(rings, .pred)
#> # A tibble: 3 × 4
#>   sex    .metric .estimator .estimate
#>   <fct>  <chr>   <chr>          <dbl>
#> 1 female rmse    standard        2.56
#> 2 infant rmse    standard        1.96
#> 3 male   rmse    standard        2.39

Metrics for model performance

abalone_metrics <- metric_set(rmse, mape)
augment(tree_fit, new_data = ring_test) %>%
  abalone_metrics(rings, .pred)
#> # A tibble: 2 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        2.31
#> 2 mape    standard       16.3

⚠️ DANGERS OF OVERFITTING ⚠️

Dangers of overfitting ⚠️

Dangers of overfitting ⚠️

Dangers of overfitting ⚠️

tree_fit %>%
  augment(ring_train)
#> # A tibble: 3,340 × 10
#>    sex    length diameter height whole_wei…¹ shuck…² visce…³ shell…⁴ rings .pred
#>    <fct>   <dbl>    <dbl>  <dbl>       <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl>
#>  1 male    0.35     0.265  0.09        0.226  0.0995  0.0485   0.07      7  8.70
#>  2 infant  0.33     0.255  0.08        0.205  0.0895  0.0395   0.055     7  7.04
#>  3 infant  0.355    0.28   0.085       0.290  0.095   0.0395   0.115     7  8.41
#>  4 male    0.365    0.295  0.08        0.256  0.097   0.043    0.1       7  8.70
#>  5 male    0.465    0.355  0.105       0.480  0.227   0.124    0.125     8  8.92
#>  6 female  0.45     0.355  0.105       0.522  0.237   0.116    0.145     8  8.92
#>  7 infant  0.24     0.175  0.045       0.07   0.0315  0.0235   0.02      5  4.5 
#>  8 infant  0.205    0.15   0.055       0.042  0.0255  0.015    0.012     5  4.5 
#>  9 infant  0.21     0.15   0.05        0.042  0.0175  0.0125   0.015     4  4.5 
#> 10 infant  0.39     0.295  0.095       0.203  0.0875  0.045    0.075     7  7.04
#> # … with 3,330 more rows, and abbreviated variable names ¹​whole_weight,
#> #   ²​shucked_weight, ³​viscera_weight, ⁴​shell_weight
#> # ℹ Use `print(n = ...)` to see more rows

We call this “resubstitution” or “repredicting the training set”

Dangers of overfitting ⚠️

tree_fit %>%
  augment(ring_train) %>%
  rmse(rings, .pred)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        1.91

We call this a “resubstitution estimate”

Dangers of overfitting ⚠️

tree_fit %>%
  augment(ring_train) %>%
  rmse(rings, .pred)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        1.91

Dangers of overfitting ⚠️

tree_fit %>%
  augment(ring_train) %>%
  rmse(rings, .pred)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        1.91
tree_fit %>%
  augment(ring_test) %>%
  rmse(rings, .pred)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard        2.31

⚠️ Remember that we’re demonstrating overfitting

⚠️ Don’t use the test set until the end of your modeling analysis

Your turn

Use augment() to compute a regression metric like mae().

Compute the metrics for both training and testing data.

Notice the evidence of overfitting! ⚠️

05:00

Dangers of overfitting ⚠️

tree_fit %>%
  augment(ring_train) %>%
  mae(rings, .pred)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 mae     standard        1.37
tree_fit %>%
  augment(ring_test) %>%
  mae(rings, .pred)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 mae     standard        1.62
  • What if we want to compare more models?

  • And/or more model configurations?

  • And we want to understand if these are important differences?

The testing data are precious 💎

How can we use the training data to compare and evaluate different models? 🤔

Cross-validation

Cross-validation

Your turn

If we use 10 folds, what percent of the training data

  • ends up in analysis
  • ends up in assessment

for each fold?

03:00

Cross-validation

vfold_cv(ring_train) # v = 10 is default
#> #  10-fold cross-validation 
#> # A tibble: 10 × 2
#>    splits             id    
#>    <list>             <chr> 
#>  1 <split [3006/334]> Fold01
#>  2 <split [3006/334]> Fold02
#>  3 <split [3006/334]> Fold03
#>  4 <split [3006/334]> Fold04
#>  5 <split [3006/334]> Fold05
#>  6 <split [3006/334]> Fold06
#>  7 <split [3006/334]> Fold07
#>  8 <split [3006/334]> Fold08
#>  9 <split [3006/334]> Fold09
#> 10 <split [3006/334]> Fold10

Cross-validation

What is in this?

ring_folds <- vfold_cv(ring_train)
ring_folds$splits[1:3]
#> [[1]]
#> <Analysis/Assess/Total>
#> <3006/334/3340>
#> 
#> [[2]]
#> <Analysis/Assess/Total>
#> <3006/334/3340>
#> 
#> [[3]]
#> <Analysis/Assess/Total>
#> <3006/334/3340>

Cross-validation

vfold_cv(ring_train, v = 5)
#> #  5-fold cross-validation 
#> # A tibble: 5 × 2
#>   splits             id   
#>   <list>             <chr>
#> 1 <split [2672/668]> Fold1
#> 2 <split [2672/668]> Fold2
#> 3 <split [2672/668]> Fold3
#> 4 <split [2672/668]> Fold4
#> 5 <split [2672/668]> Fold5

Cross-validation

vfold_cv(ring_train, strata = rings)
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 × 2
#>    splits             id    
#>    <list>             <chr> 
#>  1 <split [3004/336]> Fold01
#>  2 <split [3005/335]> Fold02
#>  3 <split [3005/335]> Fold03
#>  4 <split [3005/335]> Fold04
#>  5 <split [3005/335]> Fold05
#>  6 <split [3006/334]> Fold06
#>  7 <split [3007/333]> Fold07
#>  8 <split [3007/333]> Fold08
#>  9 <split [3008/332]> Fold09
#> 10 <split [3008/332]> Fold10

Stratification often helps, with very little downside

Cross-validation

We’ll use this setup:

set.seed(234)
ring_folds <- vfold_cv(ring_train, v = 5, strata = rings)
ring_folds
#> #  5-fold cross-validation using stratification 
#> # A tibble: 5 × 2
#>   splits             id   
#>   <list>             <chr>
#> 1 <split [2670/670]> Fold1
#> 2 <split [2672/668]> Fold2
#> 3 <split [2672/668]> Fold3
#> 4 <split [2673/667]> Fold4
#> 5 <split [2673/667]> Fold5

Set the seed when creating resamples

We are equipped with metrics and resamples!

Fit our model to the resamples

tree_res <- fit_resamples(tree_wflow, ring_folds)
tree_res
#> # Resampling results
#> # 5-fold cross-validation using stratification 
#> # A tibble: 5 × 4
#>   splits             id    .metrics         .notes          
#>   <list>             <chr> <list>           <list>          
#> 1 <split [2670/670]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [2672/668]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
#> 3 <split [2672/668]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
#> 4 <split [2673/667]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
#> 5 <split [2673/667]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

Evaluating model performance

tree_res %>%
  collect_metrics()
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   2.43      5  0.0582 Preprocessor1_Model1
#> 2 rsq     standard   0.452     5  0.0158 Preprocessor1_Model1

We can reliably measure performance using only the training data 🎉

Comparing metrics

How do the metrics from resampling compare to the metrics from training and testing?

tree_res %>%
  collect_metrics() %>% 
  select(.metric, mean, std_err)
#> # A tibble: 2 × 3
#>   .metric  mean std_err
#>   <chr>   <dbl>   <dbl>
#> 1 rmse    2.43   0.0582
#> 2 rsq     0.452  0.0158

The RMSE previously was

  • 1.91 for the training set
  • 2.31 for test set

Remember that:

⚠️ the training set gives you overly optimistic metrics

⚠️ the test set is precious

Evaluating model performance

# save the assessment set results
ctrl_abalone <- control_resamples(save_pred = TRUE)
tree_res <- fit_resamples(tree_wflow, ring_folds, control = ctrl_abalone)

tree_preds <- collect_predictions(tree_res)
tree_preds
#> # A tibble: 3,340 × 5
#>    id    .pred  .row rings .config             
#>    <chr> <dbl> <int> <dbl> <chr>               
#>  1 Fold1  7.79     1     7 Preprocessor1_Model1
#>  2 Fold1  8.39     3     7 Preprocessor1_Model1
#>  3 Fold1  7.06    10     7 Preprocessor1_Model1
#>  4 Fold1  9.92    23     7 Preprocessor1_Model1
#>  5 Fold1  9.93    24     8 Preprocessor1_Model1
#>  6 Fold1  7.06    25     7 Preprocessor1_Model1
#>  7 Fold1  7.06    30     8 Preprocessor1_Model1
#>  8 Fold1  8.74    34     8 Preprocessor1_Model1
#>  9 Fold1  9.47    38     8 Preprocessor1_Model1
#> 10 Fold1  4.36    39     5 Preprocessor1_Model1
#> # … with 3,330 more rows
#> # ℹ Use `print(n = ...)` to see more rows

tree_preds %>% 
  ggplot(aes(rings, .pred, color = id)) + 
  geom_abline(lty = 2, col = "gray", size = 1.5) +
  geom_point(alpha = 0.5) +
  coord_obs_pred()

Evaluating model performance

tree_res
#> # Resampling results
#> # 5-fold cross-validation using stratification 
#> # A tibble: 5 × 5
#>   splits             id    .metrics         .notes           .predictions      
#>   <list>             <chr> <list>           <list>           <list>            
#> 1 <split [2670/670]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [670 × 4]>
#> 2 <split [2672/668]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [668 × 4]>
#> 3 <split [2672/668]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [668 × 4]>
#> 4 <split [2673/667]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [667 × 4]>
#> 5 <split [2673/667]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [667 × 4]>

Where are the fitted models??!??

Evaluating model performance

tree_res
#> # Resampling results
#> # 5-fold cross-validation using stratification 
#> # A tibble: 5 × 5
#>   splits             id    .metrics         .notes           .predictions      
#>   <list>             <chr> <list>           <list>           <list>            
#> 1 <split [2670/670]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [670 × 4]>
#> 2 <split [2672/668]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [668 × 4]>
#> 3 <split [2672/668]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [668 × 4]>
#> 4 <split [2673/667]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [667 × 4]>
#> 5 <split [2673/667]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [667 × 4]>

Where are the fitted models??!?? 🗑️

For more advanced use cases, you can extract and save them.

Parallel processing

  • Resampling can involve fitting a lot of models!

  • These models don’t depend on one another and can be run in parallel

We can use a parallel backend to do this:

cores <- 
  parallel::detectCores(logical = FALSE)
cl <- parallel::makePSOCKcluster(cores)
doParallel::registerDoParallel(cl)

# Now call `fit_resamples()`!

# Shut it down with:
foreach::registerDoSEQ()
parallel::stopCluster(cl)
doParallel::registerDoParallel()

# Now call `fit_resamples()`!

Alternate resampling schemes

Bootstrapping

Bootstrapping

set.seed(123)
bootstraps(ring_train)
#> # Bootstrap sampling 
#> # A tibble: 25 × 2
#>    splits              id         
#>    <list>              <chr>      
#>  1 <split [3340/1213]> Bootstrap01
#>  2 <split [3340/1230]> Bootstrap02
#>  3 <split [3340/1233]> Bootstrap03
#>  4 <split [3340/1232]> Bootstrap04
#>  5 <split [3340/1207]> Bootstrap05
#>  6 <split [3340/1219]> Bootstrap06
#>  7 <split [3340/1242]> Bootstrap07
#>  8 <split [3340/1218]> Bootstrap08
#>  9 <split [3340/1234]> Bootstrap09
#> 10 <split [3340/1237]> Bootstrap10
#> # … with 15 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Your turn

Create:

  • bootstrap folds (change times from the default)
  • validation set (use the reference guide to find the function)

Don’t forget to set a seed when you resample!

05:00

Bootstrapping

set.seed(322)
bootstraps(ring_train, times = 10)
#> # Bootstrap sampling 
#> # A tibble: 10 × 2
#>    splits              id         
#>    <list>              <chr>      
#>  1 <split [3340/1217]> Bootstrap01
#>  2 <split [3340/1230]> Bootstrap02
#>  3 <split [3340/1269]> Bootstrap03
#>  4 <split [3340/1221]> Bootstrap04
#>  5 <split [3340/1254]> Bootstrap05
#>  6 <split [3340/1224]> Bootstrap06
#>  7 <split [3340/1200]> Bootstrap07
#>  8 <split [3340/1200]> Bootstrap08
#>  9 <split [3340/1224]> Bootstrap09
#> 10 <split [3340/1220]> Bootstrap10

Validation set

set.seed(853)
validation_split(ring_train, strata = rings)
#> # Validation Set Split (0.75/0.25)  using stratification 
#> # A tibble: 1 × 2
#>   splits             id        
#>   <list>             <chr>     
#> 1 <split [2504/836]> validation

A validation set is just another type of resample

Decision tree 🌳

Random forest 🌳🌲🌴🌵🌴🌳🌳🌴🌲🌵🌴🌲🌳🌴🌳🌵🌵🌴🌲🌲🌳🌴🌳🌴🌲🌴🌵🌴🌲🌴🌵🌲🌵🌴🌲🌳🌴🌵🌳🌴🌳🌴

Random forest 🌳🌲🌴🌵🌳🌳🌴🌲🌵🌴🌳🌵

  • Ensemble many decision tree models

  • All the trees vote! 🗳️

  • Bootstrap aggregating + random predictor sampling

Random forest often works well without tuning hyperparameters (more on this later!), as long as there are enough trees

Create a random forest model

rf_spec <- rand_forest(trees = 1000, mode = "regression")
rf_spec
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   trees = 1000
#> 
#> Computational engine: ranger

Create a random forest model

rf_wflow <- workflow(rings ~ ., rf_spec)
rf_wflow
#> ══ Workflow ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: rand_forest()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> rings ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   trees = 1000
#> 
#> Computational engine: ranger

Your turn

Use fit_resamples() and rf_wflow to:

  • keep predictions
  • compute metrics
  • plot true vs. predicted values
08:00

Evaluating model performance

ctrl_abalone <- control_resamples(save_pred = TRUE)

# random forest uses random numbers so set the seed first

set.seed(2)
rf_res <- fit_resamples(rf_wflow, ring_folds, control = ctrl_abalone)
collect_metrics(rf_res)
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   2.17      5  0.0622 Preprocessor1_Model1
#> 2 rsq     standard   0.548     5  0.0153 Preprocessor1_Model1

collect_predictions(rf_res) %>% 
  ggplot(aes(rings, .pred, color = id)) + 
  geom_abline(lty = 2, col = "gray", size = 1.5) +
  geom_point(alpha = 0.5) +
  coord_obs_pred()

How can we compare multiple model workflows at once? 🧐

Evaluate a workflow set

workflow_set(list(rings ~ .), list(tree_spec, rf_spec))
#> # A workflow set/tibble: 2 × 4
#>   wflow_id              info             option    result    
#>   <chr>                 <list>           <list>    <list>    
#> 1 formula_decision_tree <tibble [1 × 4]> <opts[0]> <list [0]>
#> 2 formula_rand_forest   <tibble [1 × 4]> <opts[0]> <list [0]>

Evaluate a workflow set

workflow_set(list(rings ~ .), list(tree_spec, rf_spec)) %>%
  workflow_map("fit_resamples", resamples = ring_folds)
#> # A workflow set/tibble: 2 × 4
#>   wflow_id              info             option    result   
#>   <chr>                 <list>           <list>    <list>   
#> 1 formula_decision_tree <tibble [1 × 4]> <opts[1]> <rsmp[+]>
#> 2 formula_rand_forest   <tibble [1 × 4]> <opts[1]> <rsmp[+]>

Evaluate a workflow set

workflow_set(list(rings ~ .), list(tree_spec, rf_spec)) %>%
  workflow_map("fit_resamples", resamples = ring_folds) %>%
  rank_results()
#> # A tibble: 4 × 9
#>   wflow_id              .config  .metric  mean std_err     n prepr…¹ model  rank
#>   <chr>                 <chr>    <chr>   <dbl>   <dbl> <int> <chr>   <chr> <int>
#> 1 formula_rand_forest   Preproc… rmse    2.17   0.0642     5 formula rand…     1
#> 2 formula_rand_forest   Preproc… rsq     0.546  0.0158     5 formula rand…     1
#> 3 formula_decision_tree Preproc… rmse    2.43   0.0582     5 formula deci…     2
#> 4 formula_decision_tree Preproc… rsq     0.452  0.0158     5 formula deci…     2
#> # … with abbreviated variable name ¹​preprocessor
  • Change the metric using for ranking with the rank_metric to argument

  • Lots more available with workflow sets, like collect_metrics(), autoplot() methods, and more!

Your turn

When do you think a workflow set would be useful?

03:00

The final fit

Suppose that we choose to use our random forest model.

Let’s fit the model on the training set and verify our performance using the test set.

We’ve shown you fit() and predict() (+ augment()) but there is a shortcut:

# ring_split has train + test info
final_fit <- last_fit(rf_wflow, ring_split) 

final_fit
#> # Resampling results
#> # Manual resampling 
#> # A tibble: 1 × 6
#>   splits             id               .metrics .notes   .predictions .workflow 
#>   <list>             <chr>            <list>   <list>   <list>       <list>    
#> 1 <split [3340/837]> train/test split <tibble> <tibble> <tibble>     <workflow>

What is in final_fit?

collect_metrics(final_fit)
#> # A tibble: 2 × 4
#>   .metric .estimator .estimate .config             
#>   <chr>   <chr>          <dbl> <chr>               
#> 1 rmse    standard       2.09  Preprocessor1_Model1
#> 2 rsq     standard       0.584 Preprocessor1_Model1

These are metrics computed with the test set

What is in final_fit?

collect_predictions(final_fit)
#> # A tibble: 837 × 5
#>    id               .pred  .row rings .config             
#>    <chr>            <dbl> <int> <dbl> <chr>               
#>  1 train/test split 10.5      3     9 Preprocessor1_Model1
#>  2 train/test split  8.46     6     8 Preprocessor1_Model1
#>  3 train/test split  8.78     9     9 Preprocessor1_Model1
#>  4 train/test split 10.9     13    11 Preprocessor1_Model1
#>  5 train/test split  8.32    20     9 Preprocessor1_Model1
#>  6 train/test split 10.5     25    10 Preprocessor1_Model1
#>  7 train/test split 11.0     28    12 Preprocessor1_Model1
#>  8 train/test split 11.3     29    15 Preprocessor1_Model1
#>  9 train/test split 10.7     33    18 Preprocessor1_Model1
#> 10 train/test split 10.5     39    11 Preprocessor1_Model1
#> # … with 827 more rows
#> # ℹ Use `print(n = ...)` to see more rows

These are predictions for the test set

collect_predictions(final_fit) %>%
  ggplot(aes(rings, .pred)) + 
  geom_abline(lty = 2, col = "deeppink4", size = 1.5) +
  geom_point(alpha = 0.5) +
  coord_obs_pred()

What is in final_fit?

extract_workflow(final_fit)
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: rand_forest()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> rings ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
#> 
#> Type:                             Regression 
#> Number of trees:                  1000 
#> Sample size:                      3340 
#> Number of independent variables:  8 
#> Mtry:                             2 
#> Target node size:                 5 
#> Variable importance mode:         none 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       4.681506 
#> R squared (OOB):                  0.5492882

Use this for prediction on new data, like for deploying

Going farther

decision_tree(mode = "classification")
#> Decision Tree Model Specification (classification)
#> 
#> Computational engine: rpart

Working with a classification model?

  • Classification metrics are different, and may be more complicated

  • Different classification metrics are appropriate depending on your use case

Your turn

Before lunch discussion!

Which model do you think you would decide to use?

What surprised you the most?

What is one thing you are looking forward to next?

05:00