Tidy the results of a Latent Dirichlet Allocation or Correlated Topic Model.
# S3 method for LDA
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)
# S3 method for CTM
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)
# S3 method for LDA
augment(x, data, ...)
# S3 method for CTM
augment(x, data, ...)
# S3 method for LDA
glance(x, ...)
# S3 method for CTM
glance(x, ...)
An LDA or CTM (or LDA_VEM/CTA_VEM) object from the topicmodels package
Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix
Whether beta/gamma should be on a log scale, default FALSE
Extra arguments, not used
For augment
, the data given to the LDA or CTM function, either
as a DocumentTermMatrix or as a tidied table with "document" and "term"
columns
tidy
returns a tidied version of either the beta or gamma matrix.
If matrix == "beta"
(default), returns a table with one row per topic and term,
with columns
Topic, as an integer
Term
Probability of a term generated from a topic according to the multinomial model
If matrix == "gamma"
, returns a table with one row per topic and document,
with columns
Topic, as an integer
Document name or ID
Probability of topic given document
augment
returns a table with one row per original
document-term pair, such as is returned by tdm_tidiers:
Name of document (if present), or index
Term
Topic assignment
If the data
argument is provided, any columns in the original
data are included, combined based on the document
and term
columns.
glance
always returns a one-row table, with columns
Number of iterations used
Number of terms in the model
If an LDA_VEM, the parameter of the Dirichlet distribution for topics over documents
if (requireNamespace("topicmodels", quietly = TRUE)) {
set.seed(2016)
library(dplyr)
library(topicmodels)
data("AssociatedPress", package = "topicmodels")
ap <- AssociatedPress[1:100, ]
lda <- LDA(ap, control = list(alpha = 0.1), k = 4)
# get term distribution within each topic
td_lda <- tidy(lda)
td_lda
library(ggplot2)
# visualize the top terms within each topic
td_lda_filtered <- td_lda %>%
filter(beta > .004) %>%
mutate(term = reorder(term, beta))
ggplot(td_lda_filtered, aes(term, beta)) +
geom_bar(stat = "identity") +
facet_wrap(~ topic, scales = "free") +
theme(axis.text.x = element_text(angle = 90, size = 15))
# get classification of each document
td_lda_docs <- tidy(lda, matrix = "gamma")
td_lda_docs
doc_classes <- td_lda_docs %>%
group_by(document) %>%
top_n(1) %>%
ungroup()
doc_classes
# which were we most uncertain about?
doc_classes %>%
arrange(gamma)
}
#> Selecting by gamma
#> # A tibble: 100 × 3
#> document topic gamma
#> <int> <int> <dbl>
#> 1 56 4 0.506
#> 2 71 4 0.571
#> 3 5 1 0.603
#> 4 67 2 0.655
#> 5 72 4 0.718
#> 6 28 4 0.787
#> 7 18 3 0.906
#> 8 93 2 0.907
#> 9 33 3 0.935
#> 10 76 2 0.989
#> # ℹ 90 more rows