Tidy the results of a Latent Dirichlet Allocation or Correlated Topic Model.

# S3 method for LDA
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)

# S3 method for CTM
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)

# S3 method for LDA
augment(x, data, ...)

# S3 method for CTM
augment(x, data, ...)

# S3 method for LDA
glance(x, ...)

# S3 method for CTM
glance(x, ...)

Arguments

x

An LDA or CTM (or LDA_VEM/CTA_VEM) object from the topicmodels package

matrix

Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix

log

Whether beta/gamma should be on a log scale, default FALSE

...

Extra arguments, not used

data

For augment, the data given to the LDA or CTM function, either as a DocumentTermMatrix or as a tidied table with "document" and "term" columns

Value

tidy returns a tidied version of either the beta or gamma matrix.

If matrix == "beta" (default), returns a table with one row per topic and term, with columns

topic

Topic, as an integer

term

Term

beta

Probability of a term generated from a topic according to the multinomial model

If matrix == "gamma", returns a table with one row per topic and document, with columns

topic

Topic, as an integer

document

Document name or ID

gamma

Probability of topic given document

augment returns a table with one row per original document-term pair, such as is returned by tdm_tidiers:

document

Name of document (if present), or index

term

Term

.topic

Topic assignment

If the data argument is provided, any columns in the original data are included, combined based on the document and term

columns.

glance always returns a one-row table, with columns

iter

Number of iterations used

terms

Number of terms in the model

alpha

If an LDA_VEM, the parameter of the Dirichlet distribution for topics over documents

Examples


if (requireNamespace("topicmodels", quietly = TRUE)) {
  set.seed(2016)
  library(dplyr)
  library(topicmodels)

  data("AssociatedPress", package = "topicmodels")
  ap <- AssociatedPress[1:100, ]
  lda <- LDA(ap, control = list(alpha = 0.1), k = 4)

  # get term distribution within each topic
  td_lda <- tidy(lda)
  td_lda

  library(ggplot2)

  # visualize the top terms within each topic
  td_lda_filtered <- td_lda %>%
    filter(beta > .004) %>%
    mutate(term = reorder(term, beta))

  ggplot(td_lda_filtered, aes(term, beta)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ topic, scales = "free") +
    theme(axis.text.x = element_text(angle = 90, size = 15))

  # get classification of each document
  td_lda_docs <- tidy(lda, matrix = "gamma")
  td_lda_docs

  doc_classes <- td_lda_docs %>%
    group_by(document) %>%
    top_n(1) %>%
    ungroup()

  doc_classes

  # which were we most uncertain about?
  doc_classes %>%
    arrange(gamma)
}
#> Selecting by gamma
#> # A tibble: 100 × 3
#>    document topic gamma
#>       <int> <int> <dbl>
#>  1       56     4 0.506
#>  2       71     4 0.571
#>  3        5     1 0.603
#>  4       67     2 0.655
#>  5       72     4 0.718
#>  6       28     4 0.787
#>  7       18     3 0.906
#>  8       93     2 0.907
#>  9       33     3 0.935
#> 10       76     2 0.989
#> # ℹ 90 more rows