unnest_functions(). This argument now takes either
NULL(do not collapse text across rows for tokenizing) or a character vector of variables (use said variables to collapse text across rows for tokenizing). This fixes a long-standing bug and provides more consistent behavior, but does change results for many situations (such as n-gram tokenization).
reorder_within()now handles multiple variables, thanks to @tmastny (#170)
to_lowerargument to other tokenizing functions, for more consistent behavior (#175)
glance()method for stm’s estimated regressions, thanks to @vincentarelbundock (#176)
augment()function for stm topic model.
tibble()where appropriate, thanks to @luisdza (#136).
unnest_tokenscan now unnest a data frame with a list column (which formerly threw the error
unnest_tokens expects all columns of input to be atomic vectors (not lists)). The unnested result repeats the objects within each list. (It’s still not possible when
collapse = TRUE, in which tokens can span multiple lines).
get_tidy_stopwords()to obtain stopword lexicons in multiple languages in a tidy format.
nma_wordsof negators, modals, and adverbs that affect sentiment analysis (#55).
get_sentimentsnow works regardless of whether
tidytexthas been loaded or not (#50).
unnest_tokensnow supports data.table objects (#37).
unnest_tokensto work properly for all tokenizing options.
glance.corpus, tests, and vignette for changes to quanteda API
pair_countfunction, which is now in the in-development widyr package
unnest_tokenspreserves custom attributes of data frames and data.tables
cast_dtm, and other sparse casters to ignore groups in the input (#19)
unnest_tokensso that it no longer uses tidyr’s unnest, but rather a custom version that removes some overhead. In some experiments, this sped up unnest_tokens on large inputs by about 40%. This also moves tidyr from Imports to Suggests for now.
unnest_tokensnow checks that there are no list columns in the input, and raises an error if present (since those cannot be unnested).
formatargument to unnest_tokens so that it can process html, xml, latex or man pages using the hunspell package, though only when
token = "words".
get_sentimentsfunction that takes the name of a lexicon (“nrc”, “bing”, or “sentiment”) and returns just that sentiment data frame (#25)
cast_sparseto work with dplyr 0.5.0
pair_countfunction, which has been moved to
pairwise_countin the widyr package. This will be removed entirely in a future version.