R/bind_tf_idf.R
bind_tf_idf.Rd
Calculate and bind the term frequency and inverse document frequency of a tidy text dataset, along with the product, tf-idf, to the dataset. Each of these values are added as columns. This function supports non-standard evaluation through the tidyeval framework.
bind_tf_idf(tbl, term, document, n)
A tidy text dataset with one-row-per-term-per-document
Column containing terms as string or symbol
Column containing document IDs as string or symbol
Column containing document-term counts as string or symbol
The arguments term
, document
, and n
are passed by expression and support quasiquotation;
you can unquote strings and symbols.
If the dataset is grouped, the groups are ignored but are retained.
The dataset must have exactly one row per document-term combination for this to work.
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
library(janeaustenr)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)
book_words
#> # A tibble: 40,378 × 3
#> book word n
#> <fct> <chr> <int>
#> 1 Mansfield Park the 6206
#> 2 Mansfield Park to 5475
#> 3 Mansfield Park and 5438
#> 4 Emma to 5239
#> 5 Emma the 5201
#> 6 Emma and 4896
#> 7 Mansfield Park of 4778
#> 8 Pride & Prejudice the 4331
#> 9 Emma of 4291
#> 10 Pride & Prejudice to 4162
#> # ℹ 40,368 more rows
# find the words most distinctive to each document
book_words %>%
bind_tf_idf(word, book, n) %>%
arrange(desc(tf_idf))
#> # A tibble: 40,378 × 6
#> book word n tf idf tf_idf
#> <fct> <chr> <int> <dbl> <dbl> <dbl>
#> 1 Sense & Sensibility elinor 623 0.00519 1.79 0.00931
#> 2 Sense & Sensibility marianne 492 0.00410 1.79 0.00735
#> 3 Mansfield Park crawford 493 0.00307 1.79 0.00550
#> 4 Pride & Prejudice darcy 373 0.00305 1.79 0.00547
#> 5 Persuasion elliot 254 0.00304 1.79 0.00544
#> 6 Emma emma 786 0.00488 1.10 0.00536
#> 7 Northanger Abbey tilney 196 0.00252 1.79 0.00452
#> 8 Emma weston 389 0.00242 1.79 0.00433
#> 9 Pride & Prejudice bennet 294 0.00241 1.79 0.00431
#> 10 Persuasion wentworth 191 0.00228 1.79 0.00409
#> # ℹ 40,368 more rows