Machine learning with tidymodels
Some model or preprocessing parameters cannot be estimated directly from your data
How do we know that 4️⃣ is a good value?
tune_*()
functions to tune modelsThe main two strategies for optimization are:
Grid search 💠 which tests a pre-defined set of candidate values
Iterative search 🌀 which suggests/estimates new values of candidate parameters to evaluate
ring_rec <-
recipe(rings ~ ., data = ring_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_ns(shucked_weight, deg_free = tune())
spline_wf <- workflow(ring_rec, linear_reg())
spline_wf
#> ══ Workflow ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ──────────────────────────────────────────────────────
#> 2 Recipe Steps
#>
#> • step_dummy()
#> • step_ns()
#>
#> ── Model ─────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
set.seed(123)
spline_res <- tune_grid(spline_wf, ring_folds)
spline_res
#> # Tuning results
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [2670/670]> Fold1 <tibble [18 × 5]> <tibble [0 × 3]>
#> 2 <split [2672/668]> Fold2 <tibble [18 × 5]> <tibble [0 × 3]>
#> 3 <split [2672/668]> Fold3 <tibble [18 × 5]> <tibble [0 × 3]>
#> 4 <split [2673/667]> Fold4 <tibble [18 × 5]> <tibble [0 × 3]>
#> 5 <split [2673/667]> Fold5 <tibble [18 × 5]> <tibble [0 × 3]>
Use tune_grid()
to tune your workflow with a recipe.
Collect the metrics from the results.
Use autoplot()
to visualize the results.
Try show_best()
to understand which parameter values are best.
05:00
collect_metrics(spline_res)
#> # A tibble: 18 × 7
#> deg_free .metric .estimator mean n std_err .config
#> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 13 rmse standard 2.19 5 0.0397 Preprocessor1_Model1
#> 2 13 rsq standard 0.540 5 0.00888 Preprocessor1_Model1
#> 3 8 rmse standard 2.18 5 0.0395 Preprocessor2_Model1
#> 4 8 rsq standard 0.541 5 0.00836 Preprocessor2_Model1
#> 5 11 rmse standard 2.18 5 0.0402 Preprocessor3_Model1
#> 6 11 rsq standard 0.541 5 0.00895 Preprocessor3_Model1
#> 7 4 rmse standard 2.18 5 0.0403 Preprocessor4_Model1
#> 8 4 rsq standard 0.542 5 0.00790 Preprocessor4_Model1
#> 9 7 rmse standard 2.18 5 0.0398 Preprocessor5_Model1
#> 10 7 rsq standard 0.542 5 0.00836 Preprocessor5_Model1
#> 11 14 rmse standard 2.19 5 0.0409 Preprocessor6_Model1
#> 12 14 rsq standard 0.540 5 0.00921 Preprocessor6_Model1
#> 13 2 rmse standard 2.20 5 0.0428 Preprocessor7_Model1
#> 14 2 rsq standard 0.535 5 0.00820 Preprocessor7_Model1
#> 15 6 rmse standard 2.18 5 0.0406 Preprocessor8_Model1
#> 16 6 rsq standard 0.542 5 0.00805 Preprocessor8_Model1
#> 17 3 rmse standard 2.18 5 0.0411 Preprocessor9_Model1
#> 18 3 rsq standard 0.542 5 0.00843 Preprocessor9_Model1
collect_metrics(spline_res, summarize = FALSE)
#> # A tibble: 90 × 6
#> id deg_free .metric .estimator .estimate .config
#> <chr> <int> <chr> <chr> <dbl> <chr>
#> 1 Fold1 13 rmse standard 2.11 Preprocessor1_Model1
#> 2 Fold1 13 rsq standard 0.513 Preprocessor1_Model1
#> 3 Fold2 13 rmse standard 2.24 Preprocessor1_Model1
#> 4 Fold2 13 rsq standard 0.537 Preprocessor1_Model1
#> 5 Fold3 13 rmse standard 2.31 Preprocessor1_Model1
#> 6 Fold3 13 rsq standard 0.544 Preprocessor1_Model1
#> 7 Fold4 13 rmse standard 2.11 Preprocessor1_Model1
#> 8 Fold4 13 rsq standard 0.569 Preprocessor1_Model1
#> 9 Fold5 13 rmse standard 2.15 Preprocessor1_Model1
#> 10 Fold5 13 rsq standard 0.540 Preprocessor1_Model1
#> # … with 80 more rows
#> # ℹ Use `print(n = ...)` to see more rows
show_best(spline_res)
#> # A tibble: 5 × 7
#> deg_free .metric .estimator mean n std_err .config
#> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 3 rmse standard 2.18 5 0.0411 Preprocessor9_Model1
#> 2 6 rmse standard 2.18 5 0.0406 Preprocessor8_Model1
#> 3 4 rmse standard 2.18 5 0.0403 Preprocessor4_Model1
#> 4 7 rmse standard 2.18 5 0.0398 Preprocessor5_Model1
#> 5 11 rmse standard 2.18 5 0.0402 Preprocessor3_Model1
Try different values and measure their performance
Find good values for these parameters
Finalize the model by fitting the model with these parameters to the entire training set
Yes ✅
Yes ✅
Hmmmm, probably not ❌
Nope ❌
You can control the grid used to search the parameter space
Use the grid_*()
functions, or create your own tibble
You can control the grid used to search the parameter space
Use the grid_*()
functions, or create your own tibble
You can control the grid used to search the parameter space
Use the grid_*()
functions, or create your own tibble
You can control the grid used to search the parameter space
Use the grid_*()
functions, or create your own tibble
Series of splits or if/then statements based on predictors
First the tree grows until some condition is met (maximum depth, no more data)
Then the tree is pruned to reduce its complexity
Boosting methods fit a sequence of tree-based models:
Each tree is dependent on the one before and tries to compensate for any poor results in the previous trees
This is like gradient ascent/descent methods
Most modern boosting methods have a lot of tuning parameters!
For tree growth and pruning (min_n
, max_depth
, etc)
For boosting (trees
, stop_iter
, learn_rate
)
We’ll use early stopping to stop boosting when a few iterations produce consecutively worse results.
xgb_spec <-
boost_tree(
trees = 500, min_n = tune(), stop_iter = tune(), tree_depth = tune(),
learn_rate = tune(), loss_reduction = tune()
) %>%
set_mode("regression") %>%
set_engine("xgboost", validation = 0.1)
xgb_rec <-
recipe(rings ~ ., data = ring_train) %>%
step_dummy(all_nominal_predictors())
xgb_wf <- workflow(xgb_rec, xgb_spec)
Create your boosted tree workflow.
03:00
This will take some time to run ⏳
Start tuning the boosted tree model!
We won’t wait for everyone’s tuning to finish, but take this time to get it started before we move on.
03:00
xgb_res
#> # Tuning results
#> # 5-fold cross-validation using stratification
#> # A tibble: 5 × 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [2670/670]> Fold1 <tibble [30 × 9]> <tibble [0 × 3]> <tibble>
#> 2 <split [2672/668]> Fold2 <tibble [30 × 9]> <tibble [0 × 3]> <tibble>
#> 3 <split [2672/668]> Fold3 <tibble [30 × 9]> <tibble [0 × 3]> <tibble>
#> 4 <split [2673/667]> Fold4 <tibble [30 × 9]> <tibble [0 × 3]> <tibble>
#> 5 <split [2673/667]> Fold5 <tibble [30 × 9]> <tibble [0 × 3]> <tibble>
Best logistic regression results:
Can you get better RMSE results with xgboost?
Try increasing learn_rate
beyond the original range.
20:00
best_rmse <- select_best(spline_res, metric = "rmse")
final_res <-
spline_wf %>%
finalize_workflow(best_rmse) %>%
last_fit(ring_split)
final_res
#> # Resampling results
#> # Manual resampling
#> # A tibble: 1 × 6
#> splits id .metrics .notes .predictions .workflow
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [3340/837]> train/test split <tibble> <tibble> <tibble> <workflow>
Remember that last_fit()
fits one time with the training set, then evaluates one time with the testing set.
Finalize your workflow with the best parameters.
You could use either the spline or xgboost workflow.
Create a final fit.
08:00
Holdout results from tuning:
Extract the final fitted workflow (fit using the training set):
Use explainers to characterize the model and the predictions
Create an applicability domain model to help monitor our data over time