Text Mining

USING TIDY DATA PRINCIPLES

Julia Silge

Hello!

@juliasilge

@juliasilge@fosstodon.org

youtube.com/juliasilge

juliasilge.com

Join our workspace on Posit Cloud

bit.ly/join-tidytext-tutorial

Or alternatively, install the packages yourself to work locally:

install.packages(c("tidyverse", 
                   "tidytext",
                   "stopwords",
                   "gutenbergr",
                   "widyr",
                   "tidygraph",
                   "tidylo",
                   "ggraph"))

Text in the real world

Text data is increasingly important 📚
NLP training is scarce on the ground 😱

TIDY DATA PRINCIPLES + TEXT MINING = 🎉

GitHub repo for workshop:

github.com/juliasilge/asepelt-2024

Plan for this workshop

EDA for text
Modeling for text

What do we mean by tidy text?

text <- c("Dice la tarde: '¡Tengo sed de sombra!'",
          "Dice la luna: '¡Yo, sed de luceros!'",
          "La fuente cristalina pide labios",
          "y suspira el viento.")

text
#> [1] "Dice la tarde: '¡Tengo sed de sombra!'"
#> [2] "Dice la luna: '¡Yo, sed de luceros!'"  
#> [3] "La fuente cristalina pide labios"      
#> [4] "y suspira el viento."

What do we mean by tidy text?

library(tidyverse)

text_df <- tibble(line = 1:4, text = text)

text_df
#> # A tibble: 4 × 2
#>    line text                                  
#>   <int> <chr>                                 
#> 1     1 Dice la tarde: '¡Tengo sed de sombra!'
#> 2     2 Dice la luna: '¡Yo, sed de luceros!'  
#> 3     3 La fuente cristalina pide labios      
#> 4     4 y suspira el viento.

What do we mean by tidy text?

library(tidytext)

text_df |>
    unnest_tokens(word, text)
#> # A tibble: 23 × 2
#>     line word  
#>    <int> <chr> 
#>  1     1 dice  
#>  2     1 la    
#>  3     1 tarde 
#>  4     1 tengo 
#>  5     1 sed   
#>  6     1 de    
#>  7     1 sombra
#>  8     2 dice  
#>  9     2 la    
#> 10     2 luna  
#> # ℹ 13 more rows

Gathering more data

You can access the full text of many public domain works from Project Gutenberg using the gutenbergr package.

library(gutenbergr)

full_text <- gutenberg_download(2000, mirror = my_mirror)

What book do you want to analyze today? 📖🥳📖

Time to tidy your text!

tidy_book <- full_text |>
    mutate(line = row_number()) |>
    unnest_tokens(word, text)         

glimpse(tidy_book)
#> Rows: 383,636
#> Columns: 3
#> $ gutenberg_id <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 200…
#> $ line         <int> 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 11, 11, 11, 11, 11…
#> $ word         <chr> "el", "ingenioso", "hidalgo", "don", "quijote", "de", "la…

What are the most common words?

What do you predict will happen if we run the following code? 🤔

tidy_book |>
    count(word, sort = TRUE)

What are the most common words?

What do you predict will happen if we run the following code? 🤔

tidy_book |>
    count(word, sort = TRUE)
#> # A tibble: 22,951 × 2
#>    word      n
#>    <chr> <int>
#>  1 que   20769
#>  2 de    18410
#>  3 y     18272
#>  4 la    10492
#>  5 a      9875
#>  6 en     8285
#>  7 el     8265
#>  8 no     6346
#>  9 los    4769
#> 10 se     4752
#> # ℹ 22,941 more rows

STOP WORDS 🛑

Stop words

get_stopwords()
#> # A tibble: 175 × 2
#>    word      lexicon 
#>    <chr>     <chr>   
#>  1 i         snowball
#>  2 me        snowball
#>  3 my        snowball
#>  4 myself    snowball
#>  5 we        snowball
#>  6 our       snowball
#>  7 ours      snowball
#>  8 ourselves snowball
#>  9 you       snowball
#> 10 your      snowball
#> # ℹ 165 more rows

Stop words

get_stopwords(language = "es")
#> # A tibble: 308 × 2
#>    word  lexicon 
#>    <chr> <chr>   
#>  1 de    snowball
#>  2 la    snowball
#>  3 que   snowball
#>  4 el    snowball
#>  5 en    snowball
#>  6 y     snowball
#>  7 a     snowball
#>  8 los   snowball
#>  9 del   snowball
#> 10 se    snowball
#> # ℹ 298 more rows

Stop words

get_stopwords(source = "smart")
#> # A tibble: 571 × 2
#>    word        lexicon
#>    <chr>       <chr>  
#>  1 a           smart  
#>  2 a's         smart  
#>  3 able        smart  
#>  4 about       smart  
#>  5 above       smart  
#>  6 according   smart  
#>  7 accordingly smart  
#>  8 across      smart  
#>  9 actually    smart  
#> 10 after       smart  
#> # ℹ 561 more rows

What are the most common words?

U N S C R A M B L E

anti_join(get_stopwords(language = "es")) |>

tidy_book |>

count(word, sort = TRUE) |>

geom_col()

slice_max(n, n = 20) |>

ggplot(aes(n, fct_reorder(word, n))) +

What are the most common words?

tidy_book |>
    anti_join(get_stopwords(language = "es")) |>
    count(word, sort = TRUE) |>
    slice_max(n, n = 20) |>
    ggplot(aes(n, fct_reorder(word, n))) +  
    geom_col()

WHAT IS A DOCUMENT ABOUT? 🤔

What is a document about?

Term frequency
Inverse document frequency

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

Tip

tf-idf is about comparing documents within a collection.

Understanding tf-idf

Make a collection (corpus) for yourself! 💅

full_collection <-
  gutenberg_download(
    c(2000, 49836, 56451),
    meta_fields = "title",
    mirror = my_mirror
  )

Understanding tf-idf

Make a collection (corpus) for yourself! 💅

full_collection
#> # A tibble: 55,657 × 3
#>    gutenberg_id text                                            title      
#>           <int> <chr>                                           <chr>      
#>  1         2000 "El ingenioso hidalgo don Quijote de la Mancha" Don Quijote
#>  2         2000 ""                                              Don Quijote
#>  3         2000 ""                                              Don Quijote
#>  4         2000 ""                                              Don Quijote
#>  5         2000 "por Miguel de Cervantes Saavedra"              Don Quijote
#>  6         2000 ""                                              Don Quijote
#>  7         2000 ""                                              Don Quijote
#>  8         2000 ""                                              Don Quijote
#>  9         2000 ""                                              Don Quijote
#> 10         2000 ""                                              Don Quijote
#> # ℹ 55,647 more rows

Counting word frequencies

book_words <- full_collection |>
    unnest_tokens(word, text) |>
    count(title, word, sort = TRUE)

book_words  
#> # A tibble: 44,626 × 3
#>    title       word      n
#>    <chr>       <chr> <int>
#>  1 Don Quijote que   20769
#>  2 Don Quijote de    18410
#>  3 Don Quijote y     18272
#>  4 Don Quijote la    10492
#>  5 Don Quijote a      9875
#>  6 Don Quijote en     8285
#>  7 Don Quijote el     8265
#>  8 Don Quijote no     6346
#>  9 Don Quijote los    4769
#> 10 Don Quijote se     4752
#> # ℹ 44,616 more rows

Tip

What do the columns of book_words tell us?

Calculating tf-idf

book_tf_idf <- book_words |>
    bind_tf_idf(word, title, n)

Calculating tf-idf

book_tf_idf
#> # A tibble: 44,626 × 6
#>    title       word      n     tf   idf tf_idf
#>    <chr>       <chr> <int>  <dbl> <dbl>  <dbl>
#>  1 Don Quijote que   20769 0.0541     0      0
#>  2 Don Quijote de    18410 0.0480     0      0
#>  3 Don Quijote y     18272 0.0476     0      0
#>  4 Don Quijote la    10492 0.0273     0      0
#>  5 Don Quijote a      9875 0.0257     0      0
#>  6 Don Quijote en     8285 0.0216     0      0
#>  7 Don Quijote el     8265 0.0215     0      0
#>  8 Don Quijote no     6346 0.0165     0      0
#>  9 Don Quijote los    4769 0.0124     0      0
#> 10 Don Quijote se     4752 0.0124     0      0
#> # ℹ 44,616 more rows

That’s… super exciting??? 🥴

Calculating tf-idf

What do you predict will happen if we run the following code? 🤔

book_tf_idf |>
    arrange(-tf_idf)

Calculating tf-idf

What do you predict will happen if we run the following code? 🤔

book_tf_idf |>
    arrange(-tf_idf)
#> # A tibble: 44,626 × 6
#>    title                                  word         n       tf   idf   tf_idf
#>    <chr>                                  <chr>    <int>    <dbl> <dbl>    <dbl>
#>  1 "Niebla (Nivola)"                      eugenia    231 0.00403  1.10  0.00442 
#>  2 "Niebla (Nivola)"                      usted      376 0.00655  0.405 0.00266 
#>  3 "El Payador, Vol. I\nHijo de la Pampa" gaucho     124 0.00165  1.10  0.00182 
#>  4 "Niebla (Nivola)"                      liduvina    69 0.00120  1.10  0.00132 
#>  5 "El Payador, Vol. I\nHijo de la Pampa" poema       89 0.00119  1.10  0.00130 
#>  6 "Niebla (Nivola)"                      señorito    57 0.000993 1.10  0.00109 
#>  7 "Don Quijote"                          panza      352 0.000918 1.10  0.00101 
#>  8 "Niebla (Nivola)"                      víctor      51 0.000889 1.10  0.000976
#>  9 "Niebla (Nivola)"                      mauricio    45 0.000784 1.10  0.000861
#> 10 "Don Quijote"                          dulcinea   286 0.000745 1.10  0.000819
#> # ℹ 44,616 more rows

Calculating tf-idf

U N S C R A M B L E

group_by(title) |>

book_tf_idf |>

slice_max(tf_idf, n = 10) |>

ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = title)) +

facet_wrap(vars(title), scales = "free")

geom_col(show.legend = FALSE) +

Calculating tf-idf

book_tf_idf |>
    group_by(title) |>
    slice_max(tf_idf, n = 10) |>
    ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = title)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(vars(title), scales = "free")

WHAT IS A DOCUMENT ABOUT? 🤔

What is a document about?

Term frequency
Inverse document frequency

Weighted log odds ⚖️

Log odds ratio expresses probabilities
Weighting helps deal with power law distribution

Weighted log odds ⚖️

library(tidylo)
book_words |>
    bind_log_odds(title, word, n) |>
    arrange(-log_odds_weighted)
#> # A tibble: 44,626 × 4
#>    title                                  word         n log_odds_weighted
#>    <chr>                                  <chr>    <int>             <dbl>
#>  1 "Niebla (Nivola)"                      eugenia    231              20.7
#>  2 "Niebla (Nivola)"                      usted      376              20.2
#>  3 "Niebla (Nivola)"                      augusto    372              14.8
#>  4 "El Payador, Vol. I\nHijo de la Pampa" la        3895              14.7
#>  5 "Don Quijote"                          panza      352              14.3
#>  6 "El Payador, Vol. I\nHijo de la Pampa" gaucho     124              14.3
#>  7 "Don Quijote"                          dulcinea   286              12.9
#>  8 "Niebla (Nivola)"                      no        1541              12.5
#>  9 "El Payador, Vol. I\nHijo de la Pampa" poema       89              12.1
#> 10 "Don Quijote"                          escudero   249              12.1
#> # ℹ 44,616 more rows

Tip

Weighted log odds can distinguish between words that are used in all texts.

N-GRAMS… AND BEYOND! 🚀

N-grams… and beyond! 🚀

full_text <- gutenberg_download(2000, mirror = my_mirror)

tidy_ngram <- full_text |>
    unnest_tokens(bigram, text, token = "ngrams", n = 2) |> 
    filter(!is.na(bigram))

N-grams… and beyond! 🚀

tidy_ngram
#> # A tibble: 351,706 × 2
#>    gutenberg_id bigram           
#>           <int> <chr>            
#>  1         2000 el ingenioso     
#>  2         2000 ingenioso hidalgo
#>  3         2000 hidalgo don      
#>  4         2000 don quijote      
#>  5         2000 quijote de       
#>  6         2000 de la            
#>  7         2000 la mancha        
#>  8         2000 por miguel       
#>  9         2000 miguel de        
#> 10         2000 de cervantes     
#> # ℹ 351,696 more rows

N-grams… and beyond! 🚀

tidy_ngram |>
    count(bigram, sort = TRUE)
#> # A tibble: 140,666 × 2
#>    bigram          n
#>    <chr>       <int>
#>  1 de la        2092
#>  2 don quijote  2061
#>  3 lo que       1506
#>  4 que no       1238
#>  5 de los        941
#>  6 en la         924
#>  7 en el         920
#>  8 a la          887
#>  9 de su         880
#> 10 que se        870
#> # ℹ 140,656 more rows

Tip

Can we use an anti_join() now to remove the stop words?

N-grams… and beyond! 🚀

stop_words_es <- get_stopwords("es")

bigram_counts <- tidy_ngram |>
    separate(bigram, c("word1", "word2"), sep = " ") |>
    filter(!word1 %in% stop_words_es$word,
           !word2 %in% stop_words_es$word) |>
    count(word1, word2, sort = TRUE)

N-grams… and beyond! 🚀

bigram_counts
#> # A tibble: 41,129 × 3
#>    word1     word2        n
#>    <chr>     <chr>    <int>
#>  1 don       quijote   2061
#>  2 respondió sancho     303
#>  3 dijo      don        296
#>  4 sancho    panza      281
#>  5 respondió don        265
#>  6 dijo      sancho     236
#>  7 vuesa     merced     180
#>  8 señor     don        176
#>  9 don       fernando   120
#> 10 caballero andante    104
#> # ℹ 41,119 more rows

What can you do with n-grams?

tf-idf of n-grams
weighted log odds of n-grams
network analysis
negation

Network analysis

library(widyr)
library(ggraph)
library(tidygraph)

bigram_graph <- bigram_counts |>
    filter(n > 20) |>
    as_tbl_graph()

Network analysis

bigram_graph
#> # A tbl_graph: 86 nodes and 73 edges
#> #
#> # A directed simple graph with 21 components
#> #
#> # Node Data: 86 × 1 (active)
#>    name      
#>    <chr>     
#>  1 don       
#>  2 respondió 
#>  3 dijo      
#>  4 sancho    
#>  5 vuesa     
#>  6 señor     
#>  7 caballero 
#>  8 caballeros
#>  9 merced    
#> 10 señora    
#> # ℹ 76 more rows
#> #
#> # Edge Data: 73 × 3
#>    from    to     n
#>   <int> <int> <int>
#> 1     1    29  2061
#> 2     2     4   303
#> 3     3     1   296
#> # ℹ 70 more rows

Network analysis

bigram_graph |>
    ggraph(layout = "kk") +
    geom_edge_link(aes(edge_alpha = n)) + 
    geom_node_text(aes(label = name)) +  
    theme_graph()

Network analysis

bigram_graph |>
    ggraph(layout = "kk") +
    geom_edge_link(aes(edge_alpha = n), 
                   show.legend = FALSE, 
                   arrow = arrow(length = unit(1.5, 'mm')), 
                   start_cap = circle(3, 'mm'),
                   end_cap = circle(3, 'mm')) +
    geom_node_text(aes(label = name)) + 
    theme_graph()

Thanks!

@juliasilge

@juliasilge@fosstodon.org

youtube.com/juliasilge

juliasilge.com