Machine learning with tidymodels
N = 4177
rings
sex
is a nominal predictorshucked_weight
and diameter
are numeric predictorsabalone
#> # A tibble: 4,177 × 9
#> sex length diameter height whole_weight shucked_we…¹ visce…² shell…³ rings
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 male 0.455 0.365 0.095 0.514 0.224 0.101 0.15 15
#> 2 male 0.35 0.265 0.09 0.226 0.0995 0.0485 0.07 7
#> 3 female 0.53 0.42 0.135 0.677 0.256 0.142 0.21 9
#> 4 male 0.44 0.365 0.125 0.516 0.216 0.114 0.155 10
#> 5 infant 0.33 0.255 0.08 0.205 0.0895 0.0395 0.055 7
#> 6 infant 0.425 0.3 0.095 0.352 0.141 0.0775 0.12 8
#> 7 female 0.53 0.415 0.15 0.778 0.237 0.142 0.33 20
#> 8 female 0.545 0.425 0.125 0.768 0.294 0.150 0.26 16
#> 9 male 0.475 0.37 0.125 0.509 0.216 0.112 0.165 9
#> 10 female 0.55 0.44 0.15 0.894 0.314 0.151 0.32 19
#> # … with 4,167 more rows, and abbreviated variable names ¹shucked_weight,
#> # ²viscera_weight, ³shell_weight
#> # ℹ Use `print(n = ...)` to see more rows
For machine learning, we typically split data into training and test sets:
Do not 🚫 use the test set during training.
The more data we spend 🤑
the better estimates we’ll get.
Spending too much data in training prevents us from computing a good assessment of predictive performance.
Spending too much data in testing prevents us from computing a good estimate of model parameters.
When is a good time to split your data?
03:00
ring_train
#> # A tibble: 3,132 × 9
#> sex length diameter height whole_weight shucked_we…¹ visce…² shell…³ rings
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 male 0.44 0.325 0.08 0.413 0.144 0.102 0.13 8
#> 2 infant 0.42 0.32 0.1 0.34 0.174 0.05 0.0945 8
#> 3 infant 0.355 0.28 0.11 0.224 0.0815 0.0525 0.08 7
#> 4 male 0.175 0.125 0.04 0.024 0.0095 0.006 0.005 4
#> 5 infant 0.535 0.4 0.135 0.775 0.368 0.208 0.206 8
#> 6 infant 0.435 0.335 0.1 0.324 0.135 0.0785 0.098 7
#> 7 infant 0.575 0.435 0.13 0.805 0.316 0.216 0.245 10
#> 8 male 0.455 0.345 0.125 0.44 0.169 0.106 0.135 12
#> 9 infant 0.495 0.4 0.145 0.578 0.254 0.130 0.164 8
#> 10 female 0.57 0.45 0.135 0.780 0.334 0.185 0.21 8
#> # … with 3,122 more rows, and abbreviated variable names ¹shucked_weight,
#> # ²viscera_weight, ³shell_weight
#> # ℹ Use `print(n = ...)` to see more rows
ring_test
#> # A tibble: 1,045 × 9
#> sex length diameter height whole_weight shucked_we…¹ visce…² shell…³ rings
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 female 0.53 0.42 0.135 0.677 0.256 0.142 0.21 9
#> 2 infant 0.425 0.3 0.095 0.352 0.141 0.0775 0.12 8
#> 3 female 0.535 0.405 0.145 0.684 0.272 0.171 0.205 10
#> 4 infant 0.38 0.275 0.1 0.226 0.08 0.049 0.085 10
#> 5 female 0.68 0.55 0.175 1.80 0.815 0.392 0.455 19
#> 6 infant 0.24 0.175 0.045 0.07 0.0315 0.0235 0.02 5
#> 7 male 0.47 0.37 0.12 0.580 0.293 0.227 0.14 9
#> 8 female 0.525 0.425 0.16 0.836 0.354 0.214 0.245 9
#> 9 male 0.485 0.36 0.13 0.542 0.260 0.096 0.16 10
#> 10 male 0.445 0.35 0.12 0.442 0.192 0.0955 0.135 8
#> # … with 1,035 more rows, and abbreviated variable names ¹shucked_weight,
#> # ²viscera_weight, ³shell_weight
#> # ℹ Use `print(n = ...)` to see more rows
Split your data so 20% is held out for the test set.
Try out different values in set.seed()
to see how the results change.
05:00
Explore the ring_train
data on your own!
08:00
We can transform our outcome before splitting.
Stratified sampling splits within each quartile
Use strata = rings
Stratification often helps, with very little downside