vignettes/articles/topic_modeling.Rmd
topic_modeling.Rmd
Topic modeling is a method for unsupervised classification of documents, by modeling each document as a mixture of topics and each topic as a mixture of words. Latent Dirichlet allocation is a particularly popular method for fitting a topic model.
We can use tidy text principles, as described in the main vignette, to approach topic modeling using consistent and effective tools. In particular, we’ll be using tidying functions for LDA objects from the topicmodels package.
Suppose a vandal has broken into your study and torn apart four of your books:
This vandal has torn the books into individual chapters, and left them in one large pile. How can we restore these disorganized chapters to their original books?
titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
"Pride and Prejudice", "Great Expectations")
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title")
books
## # A tibble: 51,663 × 3
## gutenberg_id text title
## <int> <chr> <chr>
## 1 36 "The War of the Worlds" The War of the Worlds
## 2 36 "" The War of the Worlds
## 3 36 "by H. G. Wells [1898]" The War of the Worlds
## 4 36 "" The War of the Worlds
## 5 36 "" The War of the Worlds
## 6 36 " But who shall dwell in these worlds if they be" The War of the Worlds
## 7 36 " inhabited? . . . Are we or they Lords of the" The War of the Worlds
## 8 36 " World? . . . And how are all things made for man?--" The War of the Worlds
## 9 36 " KEPLER (quoted in The Anatomy of Melancholy)" The War of the Worlds
## 10 36 "" The War of the Worlds
## # ℹ 51,653 more rows
As pre-processing, we divide these into chapters, use tidytext’s
unnest_tokens
to separate them into words, then remove
stop_words
. We’re treating every chapter as a separate
“document”, each with a name like Great Expectations_1
or
Pride and Prejudice_11
.
library(tidytext)
library(stringr)
library(tidyr)
by_chapter <- books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0)
by_chapter_word <- by_chapter %>%
unite(title_chapter, title, chapter) %>%
unnest_tokens(word, text)
word_counts <- by_chapter_word %>%
anti_join(stop_words) %>%
count(title_chapter, word, sort = TRUE)
word_counts
## # A tibble: 104,704 × 3
## title_chapter word n
## <chr> <chr> <int>
## 1 Great Expectations_57 joe 88
## 2 Great Expectations_7 joe 70
## 3 Great Expectations_17 biddy 63
## 4 Great Expectations_27 joe 58
## 5 Great Expectations_38 estella 58
## 6 Great Expectations_2 joe 56
## 7 Great Expectations_23 pocket 53
## 8 Great Expectations_15 joe 50
## 9 Great Expectations_18 joe 50
## 10 The War of the Worlds_16 brother 50
## # ℹ 104,694 more rows
Right now this data frame is in a tidy form, with
one-term-per-document-per-row. However, the topicmodels package requires
a DocumentTermMatrix
(from the tm package). As described in
this vignette, we can cast a
one-token-per-row table into a DocumentTermMatrix
with
tidytext’s cast_dtm
:
## <<DocumentTermMatrix (documents: 193, terms: 18202)>>
## Non-/sparse entries: 104704/3408282
## Sparsity : 97%
## Maximal term length: 19
## Weighting : term frequency (tf)
Now we are ready to use the topicmodels package to create a four topic LDA model.
library(topicmodels)
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda
## A LDA_VEM topic model with 4 topics.
(In this case we know there are four topics because there are four
books; in practice we may need to try a few different values of
k
).
Now tidytext gives us the option of returning to a tidy
analysis, using the tidy
and augment
verbs
borrowed from the broom
package. In particular, we start with the tidy
verb.
chapters_lda_td <- tidy(chapters_lda)
chapters_lda_td
## # A tibble: 72,808 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 joe 1.41e- 16
## 2 2 joe 5.13e- 54
## 3 3 joe 1.40e- 2
## 4 4 joe 2.75e- 39
## 5 1 biddy 3.96e- 22
## 6 2 biddy 5.72e- 62
## 7 3 biddy 4.63e- 3
## 8 4 biddy 3.24e- 47
## 9 1 estella 4.19e- 18
## 10 2 estella 2.09e-136
## # ℹ 72,798 more rows
Notice that this has turned the model into a one-topic-per-term-per-row format. For each combination the model has \(\beta\), the probability of that term being generated from that topic.
We could use dplyr’s slice_max()
to find the top 5 terms
within each topic:
top_terms <- chapters_lda_td %>%
group_by(topic) %>%
slice_max(beta, n = 5) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
## # A tibble: 20 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 people 0.00629
## 2 1 martians 0.00590
## 3 1 time 0.00550
## 4 1 black 0.00501
## 5 1 night 0.00464
## 6 2 captain 0.0154
## 7 2 nautilus 0.0131
## 8 2 sea 0.00883
## 9 2 nemo 0.00876
## 10 2 ned 0.00808
## 11 3 joe 0.0140
## 12 3 miss 0.00757
## 13 3 time 0.00675
## 14 3 pip 0.00661
## 15 3 looked 0.00618
## 16 4 elizabeth 0.0155
## 17 4 darcy 0.00971
## 18 4 bennet 0.00765
## 19 4 miss 0.00764
## 20 4 jane 0.00714
This model lends itself to a visualization:
library(ggplot2)
theme_set(theme_bw())
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta)) +
geom_col() +
scale_x_reordered() +
facet_wrap(vars(topic), scales = "free_x")
These topics are pretty clearly associated with the four books! There’s no question that the topic of “nemo”, “sea”, and “nautilus” belongs to Twenty Thousand Leagues Under the Sea, and that “jane”, “darcy”, and “elizabeth” belongs to Pride and Prejudice. We see “pip” and “joe” from Great Expectations and “martians”, “black”, and “night” from The War of the Worlds.
Each chapter was a “document” in this analysis. Thus, we may want to know which topics are associated with each document. Can we put the chapters back together in the correct books?
chapters_lda_gamma <- tidy(chapters_lda, matrix = "gamma")
chapters_lda_gamma
## # A tibble: 772 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 Great Expectations_57 1 0.0000134
## 2 Great Expectations_7 1 0.0000146
## 3 Great Expectations_17 1 0.0000210
## 4 Great Expectations_27 1 0.0000190
## 5 Great Expectations_38 1 0.0000127
## 6 Great Expectations_2 1 0.0000171
## 7 Great Expectations_23 1 0.290
## 8 Great Expectations_15 1 0.0000143
## 9 Great Expectations_18 1 0.0000126
## 10 The War of the Worlds_16 1 1.00
## # ℹ 762 more rows
Setting matrix = "gamma"
returns a tidied version with
one-document-per-topic-per-row. Now that we have these document
classifications, we can see how well our unsupervised learning did at
distinguishing the four books. First we re-separate the document name
into title and chapter:
chapters_lda_gamma <- chapters_lda_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
chapters_lda_gamma
## # A tibble: 772 × 4
## title chapter topic gamma
## <chr> <int> <int> <dbl>
## 1 Great Expectations 57 1 0.0000134
## 2 Great Expectations 7 1 0.0000146
## 3 Great Expectations 17 1 0.0000210
## 4 Great Expectations 27 1 0.0000190
## 5 Great Expectations 38 1 0.0000127
## 6 Great Expectations 2 1 0.0000171
## 7 Great Expectations 23 1 0.290
## 8 Great Expectations 15 1 0.0000143
## 9 Great Expectations 18 1 0.0000126
## 10 The War of the Worlds 16 1 1.00
## # ℹ 762 more rows
Then we examine what fraction of chapters we got right for each:
ggplot(chapters_lda_gamma, aes(gamma, fill = factor(topic))) +
geom_histogram() +
facet_wrap(vars(title), nrow = 2)
We notice that almost all of the chapters from Pride and Prejudice, War of the Worlds, and Twenty Thousand Leagues Under the Sea were uniquely identified as a single topic each.
chapter_classifications <- chapters_lda_gamma %>%
group_by(title, chapter) %>%
slice_max(gamma, n = 1) %>%
ungroup() %>%
arrange(gamma)
chapter_classifications
## # A tibble: 193 × 4
## title chapter topic gamma
## <chr> <int> <int> <dbl>
## 1 Great Expectations 56 3 0.499
## 2 Great Expectations 23 3 0.553
## 3 Great Expectations 3 3 0.579
## 4 Great Expectations 21 3 0.588
## 5 Great Expectations 1 3 0.597
## 6 Great Expectations 46 3 0.606
## 7 Great Expectations 55 3 0.618
## 8 Great Expectations 5 3 0.647
## 9 Great Expectations 20 3 0.647
## 10 Great Expectations 32 3 0.686
## # ℹ 183 more rows
We can determine this by finding the consensus book for each, which we note is correct based on our earlier visualization:
book_topics <- chapter_classifications %>%
count(title, topic) %>%
group_by(topic) %>%
slice_max(n, n = 1) %>%
ungroup() %>%
transmute(consensus = title, topic)
book_topics
## # A tibble: 4 × 2
## consensus topic
## <chr> <int>
## 1 The War of the Worlds 1
## 2 Twenty Thousand Leagues under the Sea 2
## 3 Great Expectations 3
## 4 Pride and Prejudice 4
Then we see which chapters were misidentified:
chapter_classifications %>%
inner_join(book_topics, by = "topic") %>%
count(title, consensus)
## # A tibble: 5 × 3
## title consensus n
## <chr> <chr> <int>
## 1 Great Expectations Great Expectations 58
## 2 Great Expectations The War of the Worlds 1
## 3 Pride and Prejudice Pride and Prejudice 61
## 4 The War of the Worlds The War of the Worlds 27
## 5 Twenty Thousand Leagues under the Sea Twenty Thousand Leagues under the Sea 46
We see that only a few chapters from Great Expectations were misclassified. Not bad for unsupervised clustering!
augment
One important step in the topic modeling expectation-maximization
algorithm is assigning each word in each document to a topic. The more
words in a document are assigned to that topic, generally, the more
weight (gamma
) will go on that document-topic
classification.
We may want to take the original document-word pairs and find which
words in each document were assigned to which topic. This is the job of
the augment
verb.
assignments <- augment(chapters_lda, data = chapters_dtm)
We can combine this with the consensus book titles to find which words were incorrectly classified.
assignments <- assignments %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
inner_join(book_topics, join_by(.topic == topic))
assignments
## # A tibble: 104,704 × 6
## title chapter term count .topic consensus
## <chr> <int> <chr> <dbl> <dbl> <chr>
## 1 Great Expectations 57 joe 88 3 Great Expectations
## 2 Great Expectations 7 joe 70 3 Great Expectations
## 3 Great Expectations 17 joe 5 3 Great Expectations
## 4 Great Expectations 27 joe 58 3 Great Expectations
## 5 Great Expectations 2 joe 56 3 Great Expectations
## 6 Great Expectations 23 joe 1 3 Great Expectations
## 7 Great Expectations 15 joe 50 3 Great Expectations
## 8 Great Expectations 18 joe 50 3 Great Expectations
## 9 Great Expectations 9 joe 44 3 Great Expectations
## 10 Great Expectations 13 joe 40 3 Great Expectations
## # ℹ 104,694 more rows
We can, for example, create a “confusion matrix” using dplyr’s
count()
and tidyr’s pivot_wider()
:
assignments %>%
count(title, consensus, wt = count) %>%
pivot_wider(names_from = consensus, values_from = n, values_fill = 0)
## # A tibble: 4 × 5
## title `Great Expectations` `Pride and Prejudice`
## <chr> <dbl> <dbl>
## 1 Great Expectations 50635 733
## 2 Pride and Prejudice 0 37242
## 3 The War of the Worlds 0 0
## 4 Twenty Thousand Leagues under the Sea 3 1
## `The War of the Worlds` `Twenty Thousand Leagues under the Sea`
## <dbl> <dbl>
## 1 4193 3
## 2 0 0
## 3 22559 9
## 4 0 39615
We notice that almost all the words for Pride and Prejudice, Twenty Thousand Leagues Under the Sea, and War of the Worlds were correctly assigned, while Great Expectations had a fair amount of misassignment.
What were the most commonly mistaken words?
## # A tibble: 3,641 × 6
## title chapter term count .topic
## <chr> <int> <chr> <dbl> <dbl>
## 1 Great Expectations 20 brother 1 1
## 2 Great Expectations 37 brother 2 4
## 3 Great Expectations 22 brother 4 4
## 4 Twenty Thousand Leagues under the Sea 8 miss 1 3
## 5 Great Expectations 5 sergeant 37 1
## 6 Great Expectations 46 captain 1 1
## 7 Great Expectations 32 captain 1 1
## 8 The War of the Worlds 17 captain 5 2
## 9 Great Expectations 54 sea 2 1
## 10 Great Expectations 1 sea 2 1
## consensus
## <chr>
## 1 The War of the Worlds
## 2 Pride and Prejudice
## 3 Pride and Prejudice
## 4 Great Expectations
## 5 The War of the Worlds
## 6 The War of the Worlds
## 7 The War of the Worlds
## 8 Twenty Thousand Leagues under the Sea
## 9 The War of the Worlds
## 10 The War of the Worlds
## # ℹ 3,631 more rows
## # A tibble: 2,820 × 4
## title consensus term n
## <chr> <chr> <chr> <dbl>
## 1 Great Expectations The War of the Worlds boat 39
## 2 Great Expectations The War of the Worlds sergeant 37
## 3 Great Expectations The War of the Worlds river 34
## 4 Great Expectations The War of the Worlds jack 28
## 5 Great Expectations The War of the Worlds tide 28
## 6 Great Expectations The War of the Worlds water 25
## 7 Great Expectations The War of the Worlds black 19
## 8 Great Expectations The War of the Worlds soldiers 19
## 9 Great Expectations The War of the Worlds london 18
## 10 Great Expectations The War of the Worlds people 18
## # ℹ 2,810 more rows
Notice the word “flopson” here; these wrong words do not necessarily appear in the novels they were misassigned to. Indeed, we can confirm “flopson” appears only in Great Expectations:
## # A tibble: 3 × 3
## title_chapter word n
## <chr> <chr> <int>
## 1 Great Expectations_22 flopson 10
## 2 Great Expectations_23 flopson 7
## 3 Great Expectations_33 flopson 1
The algorithm is stochastic and iterative, and it can accidentally land on a topic that spans multiple books.