Tidy a Corpus object from the tm package. Returns a data frame
with one-row-per-document, with a text
column containing
the document's text, and one column for each local (per-document)
metadata tag. For corpus objects from the quanteda package,
see tidy.corpus()
.
# S3 method for Corpus
tidy(x, collapse = "\n", ...)
A Corpus object, such as a VCorpus or PCorpus
A string that should be used to collapse text within each corpus (if a document has multiple lines). Give NULL to not collapse strings, in which case a corpus will end up as a list column if there are multi-line documents.
Extra arguments, not used
library(dplyr) # displaying tbl_dfs
if (requireNamespace("tm", quietly = TRUE)) {
library(tm)
#' # tm package examples
txt <- system.file("texts", "txt", package = "tm")
ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
readerControl = list(language = "lat"))
ovid
tidy(ovid)
# choose different options for collapsing text within each
# document
tidy(ovid, collapse = "")$text
tidy(ovid, collapse = NULL)$text
# another example from Reuters articles
reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- VCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain))
reuters
tidy(reuters)
}
#> Loading required package: NLP
#>
#> Attaching package: ‘NLP’
#> The following object is masked from ‘package:ggplot2’:
#>
#> annotate
#> # A tibble: 20 × 17
#> author datetimestamp description heading id language origin topics
#> <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 NA 1987-02-26 17:00:56 "" DIAMON… 127 en Reute… YES
#> 2 BY TED … 1987-02-26 17:34:11 "" OPEC M… 144 en Reute… YES
#> 3 NA 1987-02-26 18:18:00 "" TEXACO… 191 en Reute… YES
#> 4 NA 1987-02-26 18:21:01 "" MARATH… 194 en Reute… YES
#> 5 NA 1987-02-26 19:00:57 "" HOUSTO… 211 en Reute… YES
#> 6 NA 1987-03-01 03:25:46 "" KUWAIT… 236 en Reute… YES
#> 7 By Jere… 1987-03-01 03:39:14 "" INDONE… 237 en Reute… YES
#> 8 NA 1987-03-01 05:27:27 "" SAUDI … 242 en Reute… YES
#> 9 NA 1987-03-01 08:22:30 "" QATAR … 246 en Reute… YES
#> 10 NA 1987-03-01 18:31:44 "" SAUDI … 248 en Reute… YES
#> 11 NA 1987-03-02 01:05:49 "" SAUDI … 273 en Reute… YES
#> 12 NA 1987-03-02 07:39:23 "" GULF A… 349 en Reute… YES
#> 13 NA 1987-03-02 07:43:22 "" SAUDI … 352 en Reute… YES
#> 14 NA 1987-03-02 07:43:41 "" KUWAIT… 353 en Reute… YES
#> 15 NA 1987-03-02 08:25:42 "" PHILAD… 368 en Reute… YES
#> 16 NA 1987-03-02 11:20:05 "" STUDY … 489 en Reute… YES
#> 17 NA 1987-03-02 11:28:26 "" STUDY … 502 en Reute… YES
#> 18 NA 1987-03-02 12:13:46 "" UNOCAL… 543 en Reute… YES
#> 19 By BERN… 1987-03-02 14:38:34 "" NYMEX … 704 en Reute… YES
#> 20 NA 1987-03-02 14:49:06 "" ARGENT… 708 en Reute… YES
#> # ℹ 9 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
#> # topics_cat <named list>, places <named list>, people <chr>, orgs <chr>,
#> # exchanges <chr>, text <chr>