Tidy a Corpus object from the tm package. Returns a data frame with one-row-per-document, with a text column containing the document's text, and one column for each local (per-document) metadata tag. For corpus objects from the quanteda package, see tidy.corpus().

# S3 method for Corpus
tidy(x, collapse = "\n", ...)

Arguments

x

A Corpus object, such as a VCorpus or PCorpus

collapse

A string that should be used to collapse text within each corpus (if a document has multiple lines). Give NULL to not collapse strings, in which case a corpus will end up as a list column if there are multi-line documents.

...

Extra arguments, not used

Examples


library(dplyr)   # displaying tbl_dfs

if (requireNamespace("tm", quietly = TRUE)) {
  library(tm)
  #' # tm package examples
  txt <- system.file("texts", "txt", package = "tm")
  ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
                  readerControl = list(language = "lat"))

  ovid
  tidy(ovid)

  # choose different options for collapsing text within each
  # document
  tidy(ovid, collapse = "")$text
  tidy(ovid, collapse = NULL)$text

  # another example from Reuters articles
  reut21578 <- system.file("texts", "crude", package = "tm")
  reuters <- VCorpus(DirSource(reut21578),
                     readerControl = list(reader = readReut21578XMLasPlain))
  reuters

  tidy(reuters)
}
#> Loading required package: NLP
#> 
#> Attaching package: ‘NLP’
#> The following object is masked from ‘package:ggplot2’:
#> 
#>     annotate
#> # A tibble: 20 × 17
#>    author        datetimestamp       descr…¹ heading id    langu…² origin topics
#>    <chr>         <dttm>              <chr>   <chr>   <chr> <chr>   <chr>  <chr> 
#>  1 NA            1987-02-26 17:00:56 ""      DIAMON… 127   en      Reute… YES   
#>  2 BY TED D'AFF… 1987-02-26 17:34:11 ""      OPEC M… 144   en      Reute… YES   
#>  3 NA            1987-02-26 18:18:00 ""      TEXACO… 191   en      Reute… YES   
#>  4 NA            1987-02-26 18:21:01 ""      MARATH… 194   en      Reute… YES   
#>  5 NA            1987-02-26 19:00:57 ""      HOUSTO… 211   en      Reute… YES   
#>  6 NA            1987-03-01 03:25:46 ""      KUWAIT… 236   en      Reute… YES   
#>  7 By Jeremy Cl… 1987-03-01 03:39:14 ""      INDONE… 237   en      Reute… YES   
#>  8 NA            1987-03-01 05:27:27 ""      SAUDI … 242   en      Reute… YES   
#>  9 NA            1987-03-01 08:22:30 ""      QATAR … 246   en      Reute… YES   
#> 10 NA            1987-03-01 18:31:44 ""      SAUDI … 248   en      Reute… YES   
#> 11 NA            1987-03-02 01:05:49 ""      SAUDI … 273   en      Reute… YES   
#> 12 NA            1987-03-02 07:39:23 ""      GULF A… 349   en      Reute… YES   
#> 13 NA            1987-03-02 07:43:22 ""      SAUDI … 352   en      Reute… YES   
#> 14 NA            1987-03-02 07:43:41 ""      KUWAIT… 353   en      Reute… YES   
#> 15 NA            1987-03-02 08:25:42 ""      PHILAD… 368   en      Reute… YES   
#> 16 NA            1987-03-02 11:20:05 ""      STUDY … 489   en      Reute… YES   
#> 17 NA            1987-03-02 11:28:26 ""      STUDY … 502   en      Reute… YES   
#> 18 NA            1987-03-02 12:13:46 ""      UNOCAL… 543   en      Reute… YES   
#> 19 By BERNICE N… 1987-03-02 14:38:34 ""      NYMEX … 704   en      Reute… YES   
#> 20 NA            1987-03-02 14:49:06 ""      ARGENT… 708   en      Reute… YES   
#> # … with 9 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
#> #   topics_cat <named list>, places <named list>, people <chr>, orgs <chr>,
#> #   exchanges <chr>, text <chr>, and abbreviated variable names ¹​description,
#> #   ²​language