Tidiers for LDA and CTM objects from the topicmodels package

Tidy the results of a Latent Dirichlet Allocation or Correlated Topic Model.

# S3 method for class 'LDA'
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)

# S3 method for class 'CTM'
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)

# S3 method for class 'LDA'
augment(x, data, ...)

# S3 method for class 'CTM'
augment(x, data, ...)

# S3 method for class 'LDA'
glance(x, ...)

# S3 method for class 'CTM'
glance(x, ...)

Arguments

x: An LDA or CTM (or LDA_VEM/CTA_VEM) object from the topicmodels package
matrix: Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix
log: Whether beta/gamma should be on a log scale, default FALSE
...: Extra arguments, not used
data: For augment, the data given to the LDA or CTM function, either as a DocumentTermMatrix or as a tidied table with "document" and "term" columns

Value

tidy returns a tidied version of either the beta or gamma matrix.

If matrix == "beta" (default), returns a table with one row per topic and term, with columns

topic: Topic, as an integer
term: Term
beta: Probability of a term generated from a topic according to the multinomial model

If matrix == "gamma", returns a table with one row per topic and document, with columns

topic: Topic, as an integer
document: Document name or ID
gamma: Probability of topic given document

augment returns a table with one row per original document-term pair, such as is returned by tdm_tidiers:

document: Name of document (if present), or index
term: Term
.topic: Topic assignment

If the data argument is provided, any columns in the original data are included, combined based on the document and term columns.

glance always returns a one-row table, with columns

iter: Number of iterations used
terms: Number of terms in the model
alpha: If an LDA_VEM, the parameter of the Dirichlet distribution for topics over documents

Examples


if (requireNamespace("topicmodels", quietly = TRUE)) {
  set.seed(2016)
  library(dplyr)
  library(topicmodels)

  data("AssociatedPress", package = "topicmodels")
  ap <- AssociatedPress[1:100, ]
  lda <- LDA(ap, control = list(alpha = 0.1), k = 4)

  # get term distribution within each topic
  td_lda <- tidy(lda)
  td_lda

  library(ggplot2)

  # visualize the top terms within each topic
  td_lda_filtered <- td_lda |>
    filter(beta > .004) |>
    mutate(term = reorder(term, beta))

  ggplot(td_lda_filtered, aes(term, beta)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ topic, scales = "free") +
    theme(axis.text.x = element_text(angle = 90, size = 15))

  # get classification of each document
  td_lda_docs <- tidy(lda, matrix = "gamma")
  td_lda_docs

  doc_classes <- td_lda_docs |>
    group_by(document) |>
    top_n(1) |>
    ungroup()

  doc_classes

  # which were we most uncertain about?
  doc_classes |>
    arrange(gamma)
}
#> Selecting by gamma
#> # A tibble: 100 × 3
#>    document topic gamma
#>       <int> <int> <dbl>
#>  1       42     1 0.683
#>  2       54     3 0.736
#>  3       69     3 0.785
#>  4        4     1 0.836
#>  5       76     4 0.888
#>  6       63     4 0.888
#>  7       87     2 0.931
#>  8        9     3 0.992
#>  9       23     2 0.992
#> 10       22     4 0.994
#> # ℹ 90 more rows