2 - Your data budget

Machine learning with tidymodels

Abalone ages

  • Age of abalone can be determined by cutting the shell and counting the number of rings through a microscope
  • Can other measurements be used to determine age?
  • Data from The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and the Islands of Bass Strait by Nash et al (1994)
library(tidymodels)
library(tidyverse)

abalone <- read_csv("abalone.csv")

Abalone ages

  • N = 4177
  • A numeric outcome, rings
  • Other variables to use for prediction:
    • sex is a nominal predictor
    • shucked_weight and diameter are numeric predictors

Abalone ages

abalone
#> # A tibble: 4,177 × 9
#>    sex    length diameter height whole_weight shucked_we…¹ visce…² shell…³ rings
#>    <chr>   <dbl>    <dbl>  <dbl>        <dbl>        <dbl>   <dbl>   <dbl> <dbl>
#>  1 male    0.455    0.365  0.095        0.514       0.224   0.101    0.15     15
#>  2 male    0.35     0.265  0.09         0.226       0.0995  0.0485   0.07      7
#>  3 female  0.53     0.42   0.135        0.677       0.256   0.142    0.21      9
#>  4 male    0.44     0.365  0.125        0.516       0.216   0.114    0.155    10
#>  5 infant  0.33     0.255  0.08         0.205       0.0895  0.0395   0.055     7
#>  6 infant  0.425    0.3    0.095        0.352       0.141   0.0775   0.12      8
#>  7 female  0.53     0.415  0.15         0.778       0.237   0.142    0.33     20
#>  8 female  0.545    0.425  0.125        0.768       0.294   0.150    0.26     16
#>  9 male    0.475    0.37   0.125        0.509       0.216   0.112    0.165     9
#> 10 female  0.55     0.44   0.15         0.894       0.314   0.151    0.32     19
#> # … with 4,167 more rows, and abbreviated variable names ¹​shucked_weight,
#> #   ²​viscera_weight, ³​shell_weight
#> # ℹ Use `print(n = ...)` to see more rows

Data splitting and spending

For machine learning, we typically split data into training and test sets:

  • The training set is used to estimate model parameters.
  • The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Data splitting and spending

The more data we spend 🤑

the better estimates we’ll get.

Data splitting and spending

  • Spending too much data in training prevents us from computing a good assessment of predictive performance.

  • Spending too much data in testing prevents us from computing a good estimate of model parameters.

Your turn

When is a good time to split your data?

03:00

The testing data is precious 💎

Data splitting and spending

set.seed(123)
ring_split <- initial_split(abalone)
ring_split
#> <Training/Testing/Total>
#> <3132/1045/4177>

Accessing the data

ring_train <- training(ring_split)
ring_test <- testing(ring_split)

The training set

ring_train
#> # A tibble: 3,132 × 9
#>    sex    length diameter height whole_weight shucked_we…¹ visce…² shell…³ rings
#>    <chr>   <dbl>    <dbl>  <dbl>        <dbl>        <dbl>   <dbl>   <dbl> <dbl>
#>  1 male    0.44     0.325  0.08         0.413       0.144   0.102   0.13       8
#>  2 infant  0.42     0.32   0.1          0.34        0.174   0.05    0.0945     8
#>  3 infant  0.355    0.28   0.11         0.224       0.0815  0.0525  0.08       7
#>  4 male    0.175    0.125  0.04         0.024       0.0095  0.006   0.005      4
#>  5 infant  0.535    0.4    0.135        0.775       0.368   0.208   0.206      8
#>  6 infant  0.435    0.335  0.1          0.324       0.135   0.0785  0.098      7
#>  7 infant  0.575    0.435  0.13         0.805       0.316   0.216   0.245     10
#>  8 male    0.455    0.345  0.125        0.44        0.169   0.106   0.135     12
#>  9 infant  0.495    0.4    0.145        0.578       0.254   0.130   0.164      8
#> 10 female  0.57     0.45   0.135        0.780       0.334   0.185   0.21       8
#> # … with 3,122 more rows, and abbreviated variable names ¹​shucked_weight,
#> #   ²​viscera_weight, ³​shell_weight
#> # ℹ Use `print(n = ...)` to see more rows

The test set

ring_test
#> # A tibble: 1,045 × 9
#>    sex    length diameter height whole_weight shucked_we…¹ visce…² shell…³ rings
#>    <chr>   <dbl>    <dbl>  <dbl>        <dbl>        <dbl>   <dbl>   <dbl> <dbl>
#>  1 female  0.53     0.42   0.135        0.677       0.256   0.142    0.21      9
#>  2 infant  0.425    0.3    0.095        0.352       0.141   0.0775   0.12      8
#>  3 female  0.535    0.405  0.145        0.684       0.272   0.171    0.205    10
#>  4 infant  0.38     0.275  0.1          0.226       0.08    0.049    0.085    10
#>  5 female  0.68     0.55   0.175        1.80        0.815   0.392    0.455    19
#>  6 infant  0.24     0.175  0.045        0.07        0.0315  0.0235   0.02      5
#>  7 male    0.47     0.37   0.12         0.580       0.293   0.227    0.14      9
#>  8 female  0.525    0.425  0.16         0.836       0.354   0.214    0.245     9
#>  9 male    0.485    0.36   0.13         0.542       0.260   0.096    0.16     10
#> 10 male    0.445    0.35   0.12         0.442       0.192   0.0955   0.135     8
#> # … with 1,035 more rows, and abbreviated variable names ¹​shucked_weight,
#> #   ²​viscera_weight, ³​shell_weight
#> # ℹ Use `print(n = ...)` to see more rows

Your turn

Split your data so 20% is held out for the test set.

Try out different values in set.seed() to see how the results change.

05:00

Data splitting and spending

set.seed(123)
ring_split <- initial_split(abalone, prop = 0.8)
ring_train <- training(ring_split)
ring_test <- testing(ring_split)

nrow(ring_train)
#> [1] 3341
nrow(ring_test)
#> [1] 836

What about a validation set?

Exploratory data analysis for ML 🧐

Your turn

Explore the ring_train data on your own!

  • What’s the distribution of the outcome, rings?
  • What’s the distribution of numeric variables like weight?
  • How do rings differ across sex?
08:00

ggplot(ring_train, aes(rings)) +
  geom_histogram(bins = 15)

ggplot(ring_train, aes(rings, sex, fill = sex)) +
  geom_boxplot(alpha = 0.5, show.legend = FALSE)

ring_train %>%
  ggplot(aes(shucked_weight, rings, color = shell_weight)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c()

We can transform our outcome before splitting.

Split smarter 🤓

Stratified sampling splits within each quartile

Stratification

Use strata = rings

set.seed(123)
ring_split <- initial_split(abalone, prop = 0.8, strata = rings)
ring_train <- training(ring_split)
ring_test <- testing(ring_split)

Stratification often helps, with very little downside