Let’s install some packages


Text as data


cheeses <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-06-04/cheeses.csv') |>

Text as data

sample(cheeses$flavor, 5)
#> [1] "creamy, sharp, strong" "buttery, milky, sweet" "buttery, tangy"       
#> [4] "mild, nutty, sweet"    "acidic"

Cheese data from https://www.cheese.com/ via Tidy Tuesday

Text as data

What is a typical way to represent this text data for modeling?


dtm <- cheeses |>
    mutate(id = row_number()) |> 
    unnest_tokens(word, flavor) |> 
    anti_join(get_stopwords(), by = "word") |>  
    count(id, word) |>  
    bind_tf_idf(word, id, n) |> 
    cast_dfm(id, word, tf_idf)

This representation is incredibly sparse, of high dimensionality, and can have a huge number of features for natural language.

Word embeddings 📔

You shall know a word by the company it keeps.

Word embeddings, then and now

  • word2vec
  • GloVe
  • fastText
  • OpenAI
  • All examples of large language models (LLMs) in general!

Cheesy embeddings 🧀


flavor_embeddings <- 
    cheeses |>
    mutate(flavor = str_remove_all(flavor, ",")) |>
    pull(flavor) |>

Cheesy embeddings 🧀

Let’s create an overall embedding for each cheese (using mean()):

tidy_cheeses <-
    cheeses |>
    mutate(cheese_id = row_number()) |>
    unnest_tokens(word, flavor) |>
    left_join(flavor_embeddings, by = c("word" = "tokens")) |>
    group_by(cheese_id, cheese, milk, country, type) |>
    summarize(across(V1:V10, ~ mean(.x, na.rm = TRUE)), .groups = "drop")

Cheesy similarity 🧀

embeddings_mat <- 
    tidy_cheeses |> 
    select(V1:V10)  |> 

row.names(embeddings_mat) <- cheeses$cheese
embeddings_similarity <- embeddings_mat / sqrt(rowSums(embeddings_mat * embeddings_mat))
embeddings_similarity <- embeddings_similarity %*% t(embeddings_similarity)
This contains the similarity scores for each cheese flavor compared to each other cheese flavor.

Cheesy similarity 🧀

Let’s say we are most interesting in this particular cheese:


Cheesy similarity 🧀

Let’s say we are most interesting in this particular cheese:

cheeses |> 
  filter(cheese == "Manchego") |> 
  select(cheese, country, flavor)
Cheesy similarity 🧀

enframe(embeddings_similarity["Manchego",], name = "cheese", value = "similarity") |>
Cheesy similarity 🧀

cheeses |> 
  filter(cheese %in% c("Beemster Classic", "Butternut", "Loma Alta")) |> 
  select(cheese, country, flavor)
Cheesy similarity 🧀

cheeses |> 
  filter(cheese %in% c("Bayley Hazen Blue", "Alpha Tolman", "Cuor di burrata")) |> 
  select(cheese, country, flavor)
Cheesy similarity 🧀

What about the least similar cheeses to Manchego?

enframe(embeddings_similarity["Manchego",], name = "cheese", value = "similarity") |>
Cheesy similarity 🧀

cheeses |> 
  filter(cheese %in% c("Bossa", "St Cera", "Minger")) |> 
  select(cheese, country, flavor)
How do people use word embeddings? 🤔

Fairness and word embeddings

  • Embeddings are trained or learned from a large corpus of text data

  • Human prejudice or bias in the corpus becomes imprinted into the embeddings

Fairness and word embeddings

  • African American first names are associated with more unpleasant feelings than European American first names

  • Women’s first names are more associated with family and men’s first names are more associated with career

  • Terms associated with women are more associated with the arts and terms associated with men are more associated with science

Bias is so ingrained in word embeddings that they can be used to quantify change in social attitudes over time

Biased training data

  • Embeddings are trained or learned from a large corpus of text data

  • For example, consider the case of Wikipedia

  • Wikipedia both reflects social/historical biases and generates bias

Can embeddings be debiased?

  • Embeddings can be reprojected to mitigate a specific bias (such as gender bias) using specific sets of words

  • Training data can be augmented with counterfactuals

  • Other researchers suggest that fairness corrections occur at a decision

  • Evidence indicates that debiasing still allows stereotypes to seep back in

Word embeddings in the REAL WORLD




