Bind the term frequency and inverse document frequency of a tidy text dataset to the dataset

Calculate and bind the term frequency and inverse document frequency of a tidy text dataset, along with the product, tf-idf, to the dataset. Each of these values are added as columns. This function supports non-standard evaluation through the tidyeval framework.

bind_tf_idf(tbl, term, document, n)

Arguments

tbl: A tidy text dataset with one-row-per-term-per-document
term: Column containing terms as string or symbol
document: Column containing document IDs as string or symbol
n: Column containing document-term counts as string or symbol

Details

The arguments term, document, and n are passed by expression and support quasiquotation; you can unquote strings and symbols.

If the dataset is grouped, the groups are ignored but are retained.

The dataset must have exactly one row per document-term combination for this to work.

Examples


library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
library(janeaustenr)

book_words <- austen_books() |>
  unnest_tokens(word, text) |>
  count(book, word, sort = TRUE)

book_words
#> # A tibble: 40,378 × 3
#>    book              word      n
#>    <fct>             <chr> <int>
#>  1 Mansfield Park    the    6206
#>  2 Mansfield Park    to     5475
#>  3 Mansfield Park    and    5438
#>  4 Emma              to     5239
#>  5 Emma              the    5201
#>  6 Emma              and    4896
#>  7 Mansfield Park    of     4778
#>  8 Pride & Prejudice the    4331
#>  9 Emma              of     4291
#> 10 Pride & Prejudice to     4162
#> # ℹ 40,368 more rows

# find the words most distinctive to each document
book_words |>
  bind_tf_idf(word, book, n) |>
  arrange(desc(tf_idf))
#> # A tibble: 40,378 × 6
#>    book                word          n      tf   idf  tf_idf
#>    <fct>               <chr>     <int>   <dbl> <dbl>   <dbl>
#>  1 Sense & Sensibility elinor      623 0.00519  1.79 0.00931
#>  2 Sense & Sensibility marianne    492 0.00410  1.79 0.00735
#>  3 Mansfield Park      crawford    493 0.00307  1.79 0.00550
#>  4 Pride & Prejudice   darcy       373 0.00305  1.79 0.00547
#>  5 Persuasion          elliot      254 0.00304  1.79 0.00544
#>  6 Emma                emma        786 0.00488  1.10 0.00536
#>  7 Northanger Abbey    tilney      196 0.00252  1.79 0.00452
#>  8 Emma                weston      389 0.00242  1.79 0.00433
#>  9 Pride & Prejudice   bennet      294 0.00241  1.79 0.00431
#> 10 Persuasion          wentworth   191 0.00228  1.79 0.00409
#> # ℹ 40,368 more rows