tidymodels

# tidymodels

## LA RUG June meeting

### Julia Silge

---
name: clouds2
background-image: url(images/Clouds2.jpg)
background-size: cover

---
template: clouds2
class: middle, center

# <i class="fas fa-heart"></i>

Many thanks to Alison Hill and Desirée De Leon for their contributions to this talk

---
name: clouds
class: center, middle
background-image: url(images/Clouds.jpg)
background-size: cover

---
template: clouds

## .big-text[Hello]

---
template: clouds
class: middle, center

### Julia Silge

[<i class="fab  fa-github "></i> @juliasilge](https://github.com/juliasilge)  
[<i class="fab  fa-twitter "></i> @juliasilge](https://twitter.com/juliasilge)  
[<i class="fab  fa-youtube "></i> youtube.com/juliasilge](https://youtube.com/juliasilge)  
[<i class="fas  fa-link "></i> juliasilge.com](https://juliasilge.com)

---
class: inverse, middle, center

# Machine learning with tidymodels

---
class: top, center
background-image: url(images/intro.002.jpeg)
background-size: cover

---
class: top, center
background-image: url(images/intro.003.jpeg)
background-size: cover

---
background-image: url(images/tm-org.png)
background-size: contain

---

```r
library(tidymodels)
## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
## ✓ broom     0.5.6      ✓ rsample   0.0.7 
## ✓ dials     0.0.7      ✓ tune      0.1.0 
## ✓ infer     0.5.2      ✓ workflows 0.1.1 
## ✓ parsnip   0.1.1      ✓ yardstick 0.0.6 
## ✓ recipes   0.1.12
## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
```

---
class: middle, center, frame

# Three topics for today

What makes a model?

Spend your data budget wisely

Feature engineering

---

# What makes a model?

## tidymodels

---
name: train-love
background-image: url(images/train.jpg)
background-size: contain
background-color: #f6f6f6
class: bottom

Modeling in R has heterogeneous practices around model interfaces, fitting, and execution.

---
class: middle, center, frame

# parsnip

---
class: middle, frame

# .center[To specify a model in tidymodels]

1\. Pick a .display[model]

2\. Set the .display[mode] (if needed)

3\. Set the .display[engine]

]

---
class: middle, frame

1\. Pick a .display[model]
.fade[
2\. Set the .display[mode] (if needed)

3\. Set the .display[engine]

]

]
---
class: middle, center, frame

# 1\. Pick a .display[model]

All available models are listed at <https://tidymodels.org/find/parsnip>

---
class: middle

Specifies a linear regression model
]

```r
linear_reg(penalty = NULL, mixture = NULL)
```

---
class: middle

Specifies a decision tree model
]

```r
decision_tree(cost_complexity = NULL, tree_depth = NULL, min_n = NULL)
```

---
class: middle

Specifies a random forest model
]

```r
rand_forest(mtry = NULL, trees = NULL, min_n = NULL)
```

---
class: middle, frame

2\. Set the .display[mode] (if needed)

]
---
class: middle

# `set_mode()`

Some models can solve multiple types of problems

```r
linear_reg() %>% set_mode(mode = "regression")
## Linear Regression Model Specification (regression)
```

---
class: middle

# `set_mode()`

Some models can solve multiple types of problems

```r
logistic_reg() %>% set_mode(mode = "classification")
## Logistic Regression Model Specification (classification)
```

---
class: middle

# `set_mode()`

Some models can solve multiple types of problems

```r
decision_tree() %>% set_mode(mode = "classification")
## Decision Tree Model Specification (classification)
```

---
class: middle

# `set_mode()`

Some models can solve multiple types of problems

```r
decision_tree() %>% set_mode(mode = "regression")
## Decision Tree Model Specification (regression)
```

---
class: middle, frame

2\. Set the .display[mode] (if needed)
]

3\. Set the .display[engine]

]

---
class: middle

# `set_engine()`

The same model type can be implemented by multiple computational engines

```r
rand_forest() %>% set_engine("randomForest")
## Random Forest Model Specification (unknown)
## 
## Computational engine: randomForest
```

---
class: middle

# `set_engine()`

The same model type can be implemented by multiple computational engines

```r
rand_forest() %>% set_engine("ranger")
## Random Forest Model Specification (unknown)
## 
## Computational engine: ranger
```

---
class: middle

# `set_engine()`

The same model type can be implemented by multiple computational engines

```r
linear_reg() %>% set_engine("lm")
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

---
class: middle

# `set_engine()`

The same model type can be implemented by multiple computational engines

```r
linear_reg() %>% set_engine("spark")
## Linear Regression Model Specification (regression)
## 
## Computational engine: spark
```

---
class: middle, frame

# .center[What makes a model?]

```r
nearest_neighbor() %>%              
  set_engine("kknn") %>%             
  set_mode("regression")       
## K-Nearest Neighbor Model Specification (regression)
## 
## Computational engine: kknn
```

---
class: middle, frame

# .center[Harmonize heterogeneous interfaces]

|**parsnip**    |**xgboost**          |**C5.0**     |**spark**                           |
|:--------------|:--------------------|:------------|:-----------------------------------|
|tree_depth     |max_depth (6)        |NA           |max_depth (5)                       |
|trees          |nrounds (15)         |trials (15)  |max_iter (20)                       |
|learn_rate     |eta (0.3)            |NA           |step_size (0.1)                     |
|mtry           |colsample_bytree (1) |NA           |feature_subset_strategy (1 or 5) |
|min_n          |min_child_weight (1) |minCases (2) |min_instances_per_node (1)          |
|loss_reduction |gamma (0)            |NA           |min_info_gain (0)                   |
|sample_size    |subsample (1)        |sample (0)   |subsampling_rate (1)                |
|stop_iter      |early_stop           |NA           |NA                                  |

# Spending your data budget

## tidymodels

---
class: middle, center, frame

# rsample

---

# Data splitting

---

# `initial_split()`

Splits data randomly into a single testing and a single training set

```r
ames_split <- initial_split(ames, prop = 0.75)
ames_split
## <Analysis/Assess/Total>
## <2198/732/2930>
```

---

# `training()` and `testing()`

Create training and testing sets from an `rsplit`

```r
ames_train <- training(ames_split) 
ames_train
## # A tibble: 2,198 x 81
##    MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley
##    <fct>       <fct>            <dbl>    <int> <fct>  <fct>
##  1 One_Story_… Resident…          141    31770 Pave   No_A…
##  2 One_Story_… Resident…           80    11622 Pave   No_A…
##  3 One_Story_… Resident…           81    14267 Pave   No_A…
##  4 One_Story_… Resident…           93    11160 Pave   No_A…
##  5 Two_Story_… Resident…           74    13830 Pave   No_A…
##  6 Two_Story_… Resident…           78     9978 Pave   No_A…
##  7 One_Story_… Resident…           41     4920 Pave   No_A…
##  8 One_Story_… Resident…           43     5005 Pave   No_A…
##  9 One_Story_… Resident…           39     5389 Pave   No_A…
## 10 Two_Story_… Resident…           60     7500 Pave   No_A…
## # … with 2,188 more rows, and 75 more variables:
## #   Lot_Shape <fct>, Land_Contour <fct>, Utilities <fct>,
## #   Lot_Config <fct>, Land_Slope <fct>, …
```

---

# `training()` and `testing()`

Create training and testing sets from an `rsplit`

```r
ames_test <- testing(ames_split) 
ames_test
## # A tibble: 732 x 81
##    MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley
##    <fct>       <fct>            <dbl>    <int> <fct>  <fct>
##  1 Two_Story_… Resident…           75    10000 Pave   No_A…
##  2 Two_Story_… Resident…           63     8402 Pave   No_A…
##  3 One_Story_… Resident…            0     6820 Pave   No_A…
##  4 One_Story_… Resident…           88    11394 Pave   No_A…
##  5 One_Story_… Resident…            0    12537 Pave   No_A…
##  6 One_Story_… Resident…           70    10500 Pave   No_A…
##  7 One_Story_… Resident…           26     5858 Pave   No_A…
##  8 Two_Story_… Resident…           21     1680 Pave   No_A…
##  9 One_Story_… Resident…           98    11478 Pave   No_A…
## 10 One_Story_… Resident…           95    12182 Pave   No_A…
## # … with 722 more rows, and 75 more variables:
## #   Lot_Shape <fct>, Land_Contour <fct>, Utilities <fct>,
## #   Lot_Config <fct>, Land_Slope <fct>, …
```

---

background-image: url(images/diamonds.jpg)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

## We can only use it once!

]

---
template: clouds

## How can we use the training set to compare, evaluate, and tune models?

---
background-image: url(https://www.tidymodels.org/start/resampling/img/resampling.svg)
background-size: 60%

---

```r
set.seed(123)
vfold_cv(ames_train, strata = Sale_Price)
## #  10-fold cross-validation using stratification 
## # A tibble: 10 x 2
##    splits           id    
##    <list>           <chr> 
##  1 <split [2K/221]> Fold01
##  2 <split [2K/220]> Fold02
##  3 <split [2K/220]> Fold03
##  4 <split [2K/220]> Fold04
##  5 <split [2K/220]> Fold05
##  6 <split [2K/220]> Fold06
##  7 <split [2K/220]> Fold07
##  8 <split [2K/220]> Fold08
##  9 <split [2K/219]> Fold09
## 10 <split [2K/218]> Fold10
```
---
class: middle, center, inverse

# Cross-validation

---
background-image: url(images/cross-validation/Slide2.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide3.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide4.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide5.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide6.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide7.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide8.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide9.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide10.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide11.png)
background-size: contain
---

---
class: middle, center

```r
vfold_cv()
loo_cv()
mc_cv()
bootstraps()
```
---

# Feature engineering

## tidymodels

---

background-image: url(images/two-birds2-alpha.png)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

]

---
class: middle, center, frame

# recipes

---
background-image: url(images/garbage.jpg)
background-size: contain
background-position: right
class: middle, center
background-color: #f5f5f5

]

---

# To build a recipe

1\. Start the `recipe()`

2\. Define the .display[variables] involved

3\. Describe preprocessing .display[step-by-step]

---

# `recipe()`

Creates a recipe for a set of variables

```r
recipe(Sale_Price ~ ., data = ames)
```

---

# .center[`step_*()`]

Complete list at <https://recipes.tidymodels.org/reference/>

---
background-image: url(images/cranes.jpg)
background-position: left
background-size: contain
class: middle

# Preprocessing options

+ Encode categorical predictors

+ Center and scale variables

+ Handle class imbalance

+ Impute missing data

+ Perform dimensionality reduction

+ *A lot more!*
]

---

Estimate parameters for preprocessing using the .display[training data]

```r
pca_rec <- 
  recipe(Sale_Price ~ ., data = ames_train) %>%
    step_novel(all_nominal()) %>%
    step_dummy(all_nominal()) %>%
    step_zv(all_predictors()) %>%
    step_normalize(all_predictors()) %>%
    step_pca(all_predictors(), num_comp = 5)
```

---

```r
prep(pca_rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         80
## 
## Training data contained 2198 data points and no missing data.
## 
## Operations:
## 
## Novel factor level assignment for MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, Land_Contour, ... [trained]
## Dummy variables from MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, Land_Contour, Utilities, ... [trained]
## Zero variance filter removed MS_SubClass_new, MS_Zoning_new, Street_new, Alley_new, ... [trained]
## Centering and scaling for Lot_Frontage, Lot_Area, Year_Built, Year_Remod_Add, Mas_Vnr_Area, ... [trained]
## PCA extraction with Lot_Frontage, Lot_Area, Year_Built, Year_Remod_Add, Mas_Vnr_Area, ... [trained]
```

---
background-image: url(images/workflows/workflows.008.jpeg)
background-size: contain

---
template: clouds

## .big-text[More to learn!]

---
template: clouds2
class: middle, right

## Thanks!

[<i class="fas  fa-link "></i> tidymodels.org](https://tidymodels.org)  
[<i class="fas  fa-link "></i> supervised-ml-course.netlify.app](https://supervised-ml-course.netlify.app)  
[<i class="fas  fa-link "></i> www.feat.engineering](http://www.feat.engineering/)  
[<i class="fab  fa-github "></i> @tidymodels](https://github.com/tidymodels)  
[<i class="fab  fa-twitter "></i> @juliasilge](https://twitter.com/juliasilge)  
[<i class="fab  fa-youtube "></i> youtube.com/juliasilge](https://youtube.com/juliasilge)