Tidy a Corpus object from the tm package. Returns a data frame with one-row-per-document, with a text column containing the document's text, and one column for each local (per-document) metadata tag. For corpus objects from the quanteda package, see tidy.corpus.

# S3 method for Corpus
tidy(x, collapse = "\n", ...)

Arguments

x

A Corpus object, such as a VCorpus or PCorpus

collapse

A string that should be used to collapse text within each corpus (if a document has multiple lines). Give NULL to not collapse strings, in which case a corpus will end up as a list column if there are multi-line documents.

...

Extra arguments, not used

Examples

library(dplyr) # displaying tbl_dfs if (requireNamespace("tm", quietly = TRUE)) { library(tm) #' # tm package examples txt <- system.file("texts", "txt", package = "tm") ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"), readerControl = list(language = "lat")) ovid tidy(ovid) # choose different options for collapsing text within each # document tidy(ovid, collapse = "")$text tidy(ovid, collapse = NULL)$text # another example from Reuters articles reut21578 <- system.file("texts", "crude", package = "tm") reuters <- VCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain)) reuters tidy(reuters) }
#> Loading required package: NLP
#> #> Attaching package: ‘NLP’
#> The following object is masked from ‘package:ggplot2’: #> #> annotate
#> # A tibble: 20 x 17 #> author datetimestamp description heading id language origin topics #> <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 NA 1987-02-26 17:00:56 "" DIAMON… 127 en Reute… YES #> 2 BY TE… 1987-02-26 17:34:11 "" OPEC M… 144 en Reute… YES #> 3 NA 1987-02-26 18:18:00 "" TEXACO… 191 en Reute… YES #> 4 NA 1987-02-26 18:21:01 "" MARATH… 194 en Reute… YES #> 5 NA 1987-02-26 19:00:57 "" HOUSTO… 211 en Reute… YES #> 6 NA 1987-03-01 03:25:46 "" KUWAIT… 236 en Reute… YES #> 7 By Je… 1987-03-01 03:39:14 "" INDONE… 237 en Reute… YES #> 8 NA 1987-03-01 05:27:27 "" SAUDI … 242 en Reute… YES #> 9 NA 1987-03-01 08:22:30 "" QATAR … 246 en Reute… YES #> 10 NA 1987-03-01 18:31:44 "" SAUDI … 248 en Reute… YES #> 11 NA 1987-03-02 01:05:49 "" SAUDI … 273 en Reute… YES #> 12 NA 1987-03-02 07:39:23 "" GULF A… 349 en Reute… YES #> 13 NA 1987-03-02 07:43:22 "" SAUDI … 352 en Reute… YES #> 14 NA 1987-03-02 07:43:41 "" KUWAIT… 353 en Reute… YES #> 15 NA 1987-03-02 08:25:42 "" PHILAD… 368 en Reute… YES #> 16 NA 1987-03-02 11:20:05 "" STUDY … 489 en Reute… YES #> 17 NA 1987-03-02 11:28:26 "" STUDY … 502 en Reute… YES #> 18 NA 1987-03-02 12:13:46 "" UNOCAL… 543 en Reute… YES #> 19 By BE… 1987-03-02 14:38:34 "" NYMEX … 704 en Reute… YES #> 20 NA 1987-03-02 14:49:06 "" ARGENT… 708 en Reute… YES #> # … with 9 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>, #> # topics_cat <named list>, places <named list>, people <chr>, orgs <chr>, #> # exchanges <chr>, text <chr>