For the last section of the tutorial, we’ll use the text
package and some bleeding-edge tools available on the open-source site
hugging face.
Fair warning: all the packages and models we’re using here are still experimental because it’s all still really new. The code listed here works as of our tests last night, but there is no guarantee it will still work tomorrow.
library(tidyverse) # String handling
library(text) # Does all the work (but in python)
library(reticulate) # Get you python to run the text package
To run through this, you’ll need to install the text
package, which can sometimes be a challenge to install. In theory, you
can run the following code, copied here from the text
installation guide:
# Install text required python packages in a conda environment (with defaults).
text::textrpp_install()
# Show available conda environments.
reticulate::conda_list()
# Initialize the installed conda environment.
# save_profile = TRUE saves the settings so that you don't have to run textrpp_initialize() after restarting R.
text::textrpp_initialize(save_profile = TRUE)
# Test it:
textEmbed("hello")
No guarantee is made that this will work, but I’ve found that it’s pretty good most of the time.
At its default settings, the text
package will do some
decent summarization of a passage.
psu_text <- "The Pennsylvania State University is a multi-campus, land-grant, public research university that educates students from around the world and supports individuals and communities through integrated programs of teaching, research, and service.
The Pennsylvania State University's discovery-oriented, collaborative, and interdisciplinary research and scholarship promote human and economic development, global understanding, and advancement in professional practice through the expansion of knowledge and its applications in the natural and applied sciences, social and behavioral sciences, engineering, technology, arts and humanities, and myriad professions."
textSum(psu_text)
## [0;32mx completed: Duration: 1.457280 secs
## [0m
## # A tibble: 1 × 1
## sum_x
## <chr>
## 1 the Pennsylvania State University is a multi-campus, land-grant, public resea…
This simplicity is due to the ease of getting pre-trained, highly-useful, and surprisingly-small models that have been made publicly available.
By default, the text
package will automatically: 1.
Download a huggingface model (by default, t5-small
from google, which is tuned for small computers and has only about 220
million parameters.) 2. Tokenize the input text and 3. Pass it
sequentially through the model with a given choice of output.
The models are transformer models, which have a good way of
representing text strings, especially short text strings. The resulting
embedding is very nearly the same as a word2vec or doc2vec embedding.
You can pull the embeddings out yourself and use PCA or clustering
approaches on them, just like you could with word2vec output, or you can
ask for another transformation. Common examples include text completion
(see textGenerate()
) and text summarization (see
textSum()
).
A full list of models requires searching HuggingFace and reading the abstracts of the different deep learners available there.
We’ll do a simplified example just to show how easy the whole process is, by summarizing a scientific article. This is Brick, et al., 2018, on the topic of feature selection.
The file is included, but is literally just a text dump of the document.
readLines(file("FeatureSelectionPaper.txt"), n = 1)
## [1] "Feature Selection Methods for Optimal Design of Studies for Developmental Inquiry"
To run the model, we can simply grab the file and pass it to the
textSum()
function. It takes some time to run, and throws a
warning because we’re overloading the rather simple (!) deep learner
that we’re using here.
fulltext <- read_file("FeatureSelectionPaper.txt")
y <- textSum(fulltext)
## [0;32mx completed: Duration: 39.254395 secs
## [0m
The resulting summary is a pretty good definition of feature selection:
y
## # A tibble: 1 × 1
## sum_x
## <chr>
## 1 our goal is to select a small subset of measures that retains predictive powe…
For more complete summaries, you will want to use bigger models (like the t5-base model, with roughly 11Billion parameters). These will often require larger computational resources (but are shockingly fast for all that).