• Updated the tidy method for a quanteda dfm because of the upcoming release of Matrix (#218)
  • scale_x/y_reordered() now uses a function labels as its main input (#200)
  • Fixed how to_lower is passed to underlying tokenization function for character shingles (#208)
  • Added support for tidying STM models that use content, thanks to @jonathanvoelkle (#209)
  • Update testing for rlang change + testthat 3e
  • Check for installation of stopwords more gracefully
  • Update tidiers and casters for new version of quanteda
  • Use vdiffr conditionally
  • Bug fix/breaking change for collapse argument to unnest_functions(). This argument now takes either NULL (do not collapse text across rows for tokenizing) or a character vector of variables (use said variables to collapse text across rows for tokenizing). This fixes a long-standing bug and provides more consistent behavior, but does change results for many situations (such as n-gram tokenization).
  • Move one vignette to pkgdown site, because of dependency removal
  • Move all CI from Travis to GH actions
  • reorder_within() now handles multiple variables, thanks to @tmastny (#170)
  • Move stopwords to Suggests so tidytext can be installed on older versions of R
  • Pass to_lower argument to other tokenizing functions, for more consistent behavior (#175)
  • Add glance() method for stm’s estimated regressions, thanks to @vincentarelbundock (#176)
  • Update tidying test for new tibble release (inner names for columns)
  • Deprecate SE versions of main functions (have long been replaced by tidy eval semantics)
  • Improve error handling throughout
  • Wrapper tokenization functions for n-grams, characters, sentences, tweets, and more, thanks to @ColinFay (#137).
  • Simplify get_sentiments() thanks to @jennybc (#151).
  • Fix flaky tests for corpus tidiers.
  • Access NRC lexicon via textdata package
  • Fix bug in augment() function for stm topic model.
  • Warn when tf-idf is negative, thanks to @EmilHvitfeldt (#112).
  • Switch from importing broom to importing generics, for lighter dependencies (#133).
  • Add functions for reordering factors (such as for ggplot2 bar plots) thanks to @tmastny (#110).
  • Update to tibble() where appropriate, thanks to @luisdza (#136).
  • Clarify documentation about impact of lowercase conversion on URLs (#139).
  • Change how sentiment lexicons are accessed from package (remove NRC lexicon entirely, access AFINN and Loughran lexicons via textdata package so they are no longer included in this package).
  • Improvements to documentation (#117)
  • Fix for NSE thanks to @lepennec (#122).
  • Tidier for estimated regressions from stm package thanks to @jefferickson (#115).
  • Tidier for correlated topic model from topicmodels package (#123).
  • Updates to documentation (#109) thanks to Emil Hvitfeldt.
  • Add new tokenizers for tweets, Penn Treebank to unnest_tokens().
  • Better error message (#111) and code styling.
  • Declare dependency for tests.
  • Updates to documentation (#102), README, and vignettes.
  • Add tokenizing by character shingles thanks to Kanishka Misra (#105).
  • Fix tests for skip grams thanks to Lincoln Mullen (#106).
  • Updated more docs/tests so package can build on R-oldrel. (Still trying!)
  • unnest_tokens can now unnest a data frame with a list column (which formerly threw the error unnest_tokens expects all columns of input to be atomic vectors (not lists)). The unnested result repeats the objects within each list. (It’s still not possible when collapse = TRUE, in which tokens can span multiple lines).
  • Add get_tidy_stopwords() to obtain stopword lexicons in multiple languages in a tidy format.
  • Add a dataset nma_words of negators, modals, and adverbs that affect sentiment analysis (#55).
  • Updated various vignettes/docs/tests so package can build on R-oldrel.
  • Change how NA values are handled in unnest_tokens so they no longer cause other columns to become NA (#82).
  • Update tidiers and casters to align with quanteda v1.0 (#87).
  • Handle input/output object classes (such as data.table) consistently (#88).
  • Fix tidier for quanteda dictionary for correct class (#71).
  • Add a pkgdown site.
  • Convert NSE from underscored function to tidyeval (unnest_tokens, bind_tf_idf, all sparse casters) (#67, #74).
  • Added tidiers for topic models from the stm package (#51).
  • get_sentiments now works regardless of whether tidytext has been loaded or not (#50).
  • unnest_tokens now supports data.table objects (#37).
  • Fixed to_lower parameter in unnest_tokens to work properly for all tokenizing options.
  • Updated tidy.corpus, glance.corpus, tests, and vignette for changes to quanteda API
  • Removed the deprecated pair_count function, which is now in the in-development widyr package
  • Added tidiers for LDA models from the mallet package
  • Added the Loughran and McDonald dictionary of sentiment words specific to financial reports
  • unnest_tokens preserves custom attributes of data frames and data.tables
  • Updated DESCRIPTION to require purrr >= 0.1.1.
  • Fixed cast_sparse, cast_dtm, and other sparse casters to ignore groups in the input (#19)
  • Changed unnest_tokens so that it no longer uses tidyr’s unnest, but rather a custom version that removes some overhead. In some experiments, this sped up unnest_tokens on large inputs by about 40%. This also moves tidyr from Imports to Suggests for now.
  • unnest_tokens now checks that there are no list columns in the input, and raises an error if present (since those cannot be unnested).
  • Added a format argument to unnest_tokens so that it can process html, xml, latex or man pages using the hunspell package, though only when token = "words".
  • Added a get_sentiments function that takes the name of a lexicon (“nrc”, “bing”, or “sentiment”) and returns just that sentiment data frame (#25)
  • Added documentation for n-grams, skip n-grams, and regex
  • Added codecov and appveyor
  • Added tidiers for LDA objects from topicmodels and a vignette on topic modeling
  • Added function to calculate tf-idf of a tidy text dataset and a tf-idf vignette
  • Fixed a bug when tidying by line/sentence/paragraph/regex and there are multiple non-text columns
  • Fixed a bug when unnesting using n-grams and skip n-grams (entire text was not being collapsed)
  • Added ability to pass a (custom tokenizing) function to token. Also added a collapse argument that makes the choice whether to combine lines before tokenizing explicit.
  • Changed tidy.dictionary to return a tbl_df rather than a data.frame
  • Updated cast_sparse to work with dplyr 0.5.0
  • Deprecated the pair_count function, which has been moved to pairwise_count in the widyr package. This will be removed entirely in a future version.
  • Initial release for text mining using tidy tools