This set of documents is a very broad overview of a few different types of text mining as they can be performed in the R language. We’ll use tidyverse-style syntax and the tidytext and text packages to do much of the work, but pull in other packages as needed.

This Text Mining Tutorial has three parts. Part 1 aims to walk through a simple example of data preprocessing and descriptive statistics. For this, we use a social media dataset. As a starting point, we read the data, do some basic descriptives, and walk through the sequence of classical text data preprocessing. That includes tokenization to split out words, n-Gram extraction to handle compound words, lemmatization and stemming, and then some quick visualizations.

Part 1 of the tutorial gets into the nitty-gritty of text mining: data representations via term-document (or document-term) and document features matrix, and applying tools like sentiment analysis, lexicon models, and topic modeling via latent Dirichlet allocation (LDA).

Finally, Part 3 walks through the application of deep learning tools, including deep word embeddings and document summarization.

Data Description The first data set used in this tutorial is publicly available and retrieved from Kaggle.com.

The Social Media Sentiments Analysis Dataset captures a vibrant tapestry of emotions, trends, and interactions across various social media platforms. This dataset provides a snapshot of user-generated content, encompassing text, timestamps, hashtags, countries, likes, and retweets. Each entry unveils unique stories—moments of surprise, excitement, admiration, thrill, contentment, and more—shared by individuals worldwide.

Text: refers to user generated content showcasing sentiments Sentiment: are categorized emotions (categorization choices unclear -tb) Timestamp: indicates the date and time information User: is the unique ID of users Platform: Social Media platform where the content originated Hashtags: Identifies trending topics and themes”

Reference: Parmar, K. (2024). Social Media Sentiments Analysis Dataset, Version 3. Retrieved May 5, 2024 from https://www.kaggle.com/datasets/kashishparmar02/social-media-sentiments-analysis-dataset/data

PART 1: Text Data Handling and Descriptives

1. Loading Libraries

First, we need to load in the packages we’ll be using. I’ll assume that you have them all installed; if not, use install.packages() to install them.

library(tm)
library(Matrix)
library(tidyverse)
library(ggthemes)     # for colorblind-friendly color palettes
library(GGally)       # for ggpairs()
library(tidytext)     # handy for text mining 
library(textstem)     # lemmatization and stemming 
library(wordcloud)    # Wordcloud
library(RColorBrewer) # For more colors!
library(wordcloud2)   # Wordcloud
library(quanteda)     # DTMs and DFMs

2. Read Data

Next, we’ll read in the data set. It’s always important to be careful when reading in text data, since sometimes there will be issues with quotes or escapes, and the read-in will get messy. Here, it’s a clean data set, so we don’t have to worry too much about that.

socmedia_data<-read.csv("sentimentdataset.csv", header = T)
head(socmedia_data)  # To make sure it read in right.

##   ID                                                 Text   Sentiment
## 1  0  Enjoying a beautiful day at the park!                Positive  
## 2  1  Traffic was terrible this morning.                   Negative  
## 3  2  Just finished an amazing workout! 💪                 Positive  
## 4  3  Excited about the upcoming weekend getaway!          Positive  
## 5  4  Trying out a new recipe for dinner tonight.          Neutral   
## 6  5  Feeling grateful for the little things in life.      Positive  
##             Timestamp           User    Platform
## 1 2023-01-15 12:30:00  User123         Twitter  
## 2 2023-01-15 08:45:00  CommuterX       Twitter  
## 3 2023-01-15 15:45:00  FitnessFan     Instagram 
## 4 2023-01-15 18:20:00  AdventureX      Facebook 
## 5 2023-01-15 19:55:00  ChefCook       Instagram 
## 6 2023-01-16 09:10:00  GratitudeNow    Twitter  
##                                     Hashtags Retweets Likes      Country Year
## 1  #Nature #Park                                   15    30    USA       2023
## 2  #Traffic #Morning                                5    10    Canada    2023
## 3  #Fitness #Workout                               20    40  USA         2023
## 4  #Travel #Adventure                               8    15    UK        2023
## 5  #Cooking #Food                                  12    25   Australia  2023
## 6    #Gratitude #PositiveVibes                     25    50    India     2023
##   Month Day Hour
## 1     1  15   12
## 2     1  15    8
## 3     1  15   15
## 4     1  15   18
## 5     1  15   19
## 6     1  16    9

The glimpse() function in the dplyr package is one way to quickly glance at the columns of the data set. The describe() function in the psych package also works well here.

glimpse(socmedia_data)

## Rows: 732
## Columns: 14
## $ ID        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Text      <chr> " Enjoying a beautiful day at the park!              ", " Tr…
## $ Sentiment <chr> " Positive  ", " Negative  ", " Positive  ", " Positive  ", …
## $ Timestamp <chr> "2023-01-15 12:30:00", "2023-01-15 08:45:00", "2023-01-15 15…
## $ User      <chr> " User123      ", " CommuterX    ", " FitnessFan   ", " Adve…
## $ Platform  <chr> " Twitter  ", " Twitter  ", " Instagram ", " Facebook ", " I…
## $ Hashtags  <chr> " #Nature #Park                            ", " #Traffic #Mo…
## $ Retweets  <int> 15, 5, 20, 8, 12, 25, 10, 15, 30, 18, 22, 7, 12, 28, 15, 20,…
## $ Likes     <int> 30, 10, 40, 15, 25, 50, 20, 30, 60, 35, 45, 15, 25, 55, 30, …
## $ Country   <chr> " USA      ", " Canada   ", " USA        ", " UK       ", " …
## $ Year      <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, …
## $ Month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Day       <int> 15, 15, 15, 15, 15, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19, …
## $ Hour      <int> 12, 8, 15, 18, 19, 9, 14, 19, 8, 12, 15, 10, 14, 18, 9, 13, …

At this point, we’re only looking for errors and issues. One thing that stands out here is how many spaces there are after country names and around the actual text. As of now, that’s not an issue, but we’ll keep it in mind for later.

3. Frequency of a few Columns with Visualization

The count() function (also from dplyr) will return a count of each of the unique values of a variable. Here, we’ll take a look at the platform frequency so that we know what we’re dealing with.

platform_count<- socmedia_data %>% count(Platform)

head(platform_count)

##      Platform   n
## 1   Facebook  231
## 2  Instagram  258
## 3    Twitter  128
## 4   Twitter   115

We can see there are two different rows for Twitter. This is probably related to that spacing problem we saw before. We can double-check by looking closer:

platform_count$Platform

## [1] " Facebook "  " Instagram " " Twitter "   " Twitter  "

Yes. One of the Twitter entries has two spaces after, the other has only one. This might result from different data collection tools, or some sort of version difference (like the switch to X, possibly) during data collection. But it doesn’t matter; we will just combine them by using str_trim() from the stringr package to handle the spaces, and collapse it down to just one Twitter in our data set.

We saw in our glimpse() that User, Country, Platform, Hashtags, and Sentiment all suffer from the same problem, so we’ll reduce them all here.

socmedia_data <- socmedia_data %>%
                  mutate(across(c(Platform, Country, User, Sentiment, Hashtags),
                                str_trim))

It’s worth checking that that worked:

glimpse(socmedia_data)

## Rows: 732
## Columns: 14
## $ ID        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Text      <chr> " Enjoying a beautiful day at the park!              ", " Tr…
## $ Sentiment <chr> "Positive", "Negative", "Positive", "Positive", "Neutral", "…
## $ Timestamp <chr> "2023-01-15 12:30:00", "2023-01-15 08:45:00", "2023-01-15 15…
## $ User      <chr> "User123", "CommuterX", "FitnessFan", "AdventureX", "ChefCoo…
## $ Platform  <chr> "Twitter", "Twitter", "Instagram", "Facebook", "Instagram", …
## $ Hashtags  <chr> "#Nature #Park", "#Traffic #Morning", "#Fitness #Workout", "…
## $ Retweets  <int> 15, 5, 20, 8, 12, 25, 10, 15, 30, 18, 22, 7, 12, 28, 15, 20,…
## $ Likes     <int> 30, 10, 40, 15, 25, 50, 20, 30, 60, 35, 45, 15, 25, 55, 30, …
## $ Country   <chr> "USA", "Canada", "USA", "UK", "Australia", "India", "Canada"…
## $ Year      <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, …
## $ Month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Day       <int> 15, 15, 15, 15, 15, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19, …
## $ Hour      <int> 12, 8, 15, 18, 19, 9, 14, 19, 8, 12, 15, 10, 14, 18, 9, 13, …

# Counting each platform occurrence using count() function again 
platform_count_unspaced <- socmedia_data %>%
  count(Platform)

# Print the result. We could still use head, but now we just have 3 rows. 
print(platform_count_unspaced)

##    Platform   n
## 1  Facebook 231
## 2 Instagram 258
## 3   Twitter 243

Much better. We can plot this quickly just as an excuse to show ggplot to anyone unfamiliar with it.

#Visualize the frequency of social media platforms in the data
smp <- ggplot(platform_count_unspaced, aes(x=Platform, y=n)) + 
  geom_col(aes(fill=Platform), color="black") +
  scale_fill_colorblind()
smp

We can similarly plot countries, just to make sure the space removal worked.

country_count<- socmedia_data %>% 
                    count(Country) %>%
                    top_n(10, wt=n)
ggplot(country_count, aes(x=Country, y=n)) + 
  geom_col(aes(fill=Country), color="black") +
  theme(axis.text.x = element_text(angle = 45))

And sentiments, similarly.

sentiment_count<- socmedia_data %>% 
                    count(Sentiment) %>%
                    top_n(20, wt=n)
ggplot(sentiment_count, aes(x=Sentiment, y=n)) + 
  geom_col(aes(fill=Sentiment), color="black") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

We can see there are some very clear Happy sentiments (like Joy, Hopeful, Gratitude and Excitement) and Negative sentiments as well (like Sad and Embarrassed). Neutral, Curiosity and Contentment falls somewhere in between.

It is not clear that these labels come from any sort of structured emotion space, but for our purposes that doesn’t matter.

4. Tokenization

Next, we’ll take a look at the actual text we’re interested in.

head(socmedia_data$Text)

## [1] " Enjoying a beautiful day at the park!              "
## [2] " Traffic was terrible this morning.                 "
## [3] " Just finished an amazing workout! 💪               "
## [4] " Excited about the upcoming weekend getaway!        "
## [5] " Trying out a new recipe for dinner tonight.        "
## [6] " Feeling grateful for the little things in life.    "

Before we dig in on the details of this, we need to break these posts down into smaller pieces, a process called Tokenization.

Tokenization in text mining involves breaking down a large body of text into smaller pieces called “tokens”. Arguably, tokens could be phrases, sentences, or letters, but most often we are looking to split the text into words. The unnest_tokens() function from the tidytext package does this automatically using some sensible defaults.

Note that we want to keep track of which post each word came from, so we’ll hang onto the “X” variable for now as an ID.

#Extract from Text column and store it in the column `word`
token.socmedia <- socmedia_data %>%
  unnest_tokens(word, Text) 

head(token.socmedia$word)

## [1] "enjoying"  "a"         "beautiful" "day"       "at"        "the"

This increases our data set to 9710 records, where each record contains a word from the script in the social media dataset.

# Let's see what are the most common words in the Text 
token.socmedia %>%
  count(word, sort = TRUE) %>%
  head()

##   word   n
## 1  the 830
## 2    a 656
## 3   of 623
## 4   in 300
## 5   to 134
## 6  and 111

We can see some common words such as ‘the’, ‘a’, ‘of’, ‘in’… etc occurred more frequently. That makes sense, but isn’t very interesting. These words are called “stop words”, and the process of removing them is called “stopping”.

5. Stopping

To analyze the distinctive words use in the social media posts (Text), we will want to remove these unnecessary words. It can be done with an anti_join() function against tidy text’s list of stop_words, but here we’ll simply use filter().

stopped.socmedia <- token.socmedia %>%
  filter(!word %in% stop_words$word)

head(stopped.socmedia$word)

## [1] "enjoying"  "beautiful" "day"       "park"      "traffic"   "terrible"

Notice that “a”, “in”, and “the” are now missing. And now we can re-generate our list of common words. Again, we could print the count, but plots are often easier to read.

# Final Frequency of the 'words' arranged in descending order
stopped.frequencies<- stopped.socmedia %>%
  count(word, sort=TRUE) %>%
  mutate(word = reorder(word, n))

ggplot(stopped.frequencies %>% top_n(20, wt=n), 
                aes(x=word, y=n, fill=word)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

Well, this data set was developed to be focused on sentiment, so it makes sense that many of these are emotion words.

6.Tokenizing by n-grams

Tokenization is the first step in text analysis and Tokens set the stage for deeper processing. To highlight the problems of tokenization, though, consider the following sentence, adapted from The Pennsylvania State University website about its mission and values.

# The custom text
psu_text <- "The Pennsylvania State University is a multi-campus, land-grant, public research university that educates students from around the world and supports individuals and communities through integrated programs of teaching, research, and service.
The Pennsylvania State University's discovery-oriented, collaborative, and interdisciplinary research and scholarship promote human and economic development, global understanding, and advancement in professional practice through the expansion of knowledge and its applications in the natural and applied sciences, social and behavioral sciences, engineering, technology, arts and humanities, and myriad professions."

# Make it a tibble for analysis.
psu_tibble <- tibble(text = psu_text)

When we tokenize this paragraph, we can start to see one challenge of tokenization.

psu_tibble %>% 
      unnest_tokens(word, text) %>% 
      count(word, sort=TRUE) %>%
      top_n(20, wt=n)

## # A tibble: 59 × 2
##    word             n
##    <chr>        <int>
##  1 and             12
##  2 the              5
##  3 research         3
##  4 in               2
##  5 of               2
##  6 pennsylvania     2
##  7 sciences         2
##  8 state            2
##  9 through          2
## 10 university       2
## # ℹ 49 more rows

It’s notable that Pennsylvania is a common word. But it’s not ever used in the text to refer to the Commonwealth of Pennsylvania. Instead, it’s part of a larger compound: “The Pennsylvania State University.” This is an example of an n-gram.

N-grams are a continuous sequences of ‘n’ items from a sample text or speech, where ‘n’ can be any number. N-grams are used to capture the context in text data that single words (1-grams) might miss. They also can be helpful to improve understanding of the text.

A 2-gram is usually referred to as a bigram.

# Unnest the text to extract bigrams
example_bigrams <- psu_tibble %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram))

print(head(example_bigrams))

## # A tibble: 6 × 1
##   bigram            
##   <chr>             
## 1 the pennsylvania  
## 2 pennsylvania state
## 3 state university  
## 4 university is     
## 5 is a              
## 6 a multi

This kind of cascading overlap is usually indicative that you need to go up another level of n-gram. People will sometimes refer to 3-grams as trigrams.

# Unnest the text to extract trigrams
example_trigrams <- psu_tibble %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  filter(!is.na(trigram))

print(head(example_trigrams))

## # A tibble: 6 × 1
##   trigram                      
##   <chr>                        
## 1 the pennsylvania state       
## 2 pennsylvania state university
## 3 state university is          
## 4 university is a              
## 5 is a multi                   
## 6 a multi campus

I’ve never heard anybody use the word quatrigram; they are just 4-grams at this point.

# Unnest the text to extract quadragrams (four)

example_4grams <- psu_tibble %>%
  unnest_tokens(quadgram, text, token = "ngrams", n = 4) %>%
  filter(!is.na(quadgram))

print(head(example_4grams))

## # A tibble: 6 × 1
##   quadgram                         
##   <chr>                            
## 1 the pennsylvania state university
## 2 pennsylvania state university is 
## 3 state university is a            
## 4 university is a multi            
## 5 is a multi campus                
## 6 a multi campus land

We can see that 4-grams make most sense here at least with understanding the full name of the University. In practice, we will mostly look for bigrams or trigrams that are meaningful, common, and notably different than the individual words. So “multicampus”, here, might be valuable.

It is worth noting that stopwords can be a part of n-grams, as well. It can be helpful to filter those:

example_bigrams <- example_bigrams %>% 
  separate(bigram, c("first.word", "second.word"), sep=" ") %>%
  filter(!first.word %in% stop_words$word) %>%
  filter(!second.word %in% stop_words$word) %>%
  unite(bigram, first.word, second.word, sep=" ")

We’ll return to n-grams when we talk about ‘descriptives’.

7. Lemmatization and Stemming

Lemmatization is a pre-processing step in Natural Language Processing (NLP) that involves reducing a word to its base or root form, known as the ‘lemma’. So caring is reduced to care. A similar process is called stemming, which is simpler because it simply drops the ending; it would generate car as the stem. Lemmatization is better than stemming, but requires a lookup table of lemmas.

The textstem package will happily lemmatize our tweets:

# Apply lemmatization to the text column
socmedia_data <- socmedia_data %>% mutate(lemma_text = lemmatize_strings(Text))
# Note: use lemmatize_words() if you're doing this post-tokenization.

The difference can be seen fairly easily in just the first few tweets:

socmedia_data[1:3, c("Text", "lemma_text")]

##                                                   Text
## 1  Enjoying a beautiful day at the park!              
## 2  Traffic was terrible this morning.                 
## 3  Just finished an amazing workout! 💪               
##                           lemma_text
## 1 enjoy a beautiful day at the park!
## 2  Traffic be terrible this morning.
## 3    Just finish a amaze workout! 💪

You can also see why this kind of processing is helpful to word counting: it reduces “amazing”, “amazed”, and “amaze” all to the same word.

It can also be helpful to transform words into hypernyms using a tool like wordnet. Hypernyms are category words; for example, “bird” is a hypernym of “robin”.

8. Documents and Corpora

When we want to look at differences between documents, we need a good representation for that. There are two very common representations: the term-document matrix, and the document-feature matrix. A term-document matrix has a row for each term and a column for each document, and a value in each location showing the frequency of that word in that document. A document-feature matrix is the transpose, with one row per document and one column per (important) term. It can be really useful here to stop and lemmatize before computing these, since extra words make the matrices much bigger.

Tidytext makes these very easy to create. We will need to lemmatize our stopped text first.

socmedia_dfm <- stopped.socmedia %>% 
                mutate(word = lemmatize_words(word)) %>% # lemmatize
                group_by(User, word) %>%
                summarize(n=n()) %>%
                ungroup() %>% 
                cast_dfm(User, word, n)

## `summarise()` has grouped output by 'User'. You can override using the
## `.groups` argument.

9. Descriptive plots

One of the most common approaches to descriptive plotting of a given data set is the wordcloud. It’s not actually the most helpful, but they are often eye-catching and can give a good impression right away of what’s happening in a body of text. We can build one from our lemmatized text data using the wordcloud() function from the wordcloud package.

# Let's plot the lemmatized data using wordcloud()
wordcloud(socmedia_data$lemma_text,
          min.freq = 100,
          max.words = 300,
          random.order = FALSE,
          random.color = FALSE,
          rot.per=0.35,
          colors = brewer.pal(8, "Dark2"))

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : memory could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : adventure could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : creativity could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : shatter could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : school could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : surprise could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : embark could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : reflect could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : thought could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : nostalgia could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : ancient could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : make could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : turn could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : community could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : melody could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : hope could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : connection could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : curiosity could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : wave could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : bloom could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : inspiration could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : upcoming could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : sunset could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : city could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : contentment could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : silent could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : storm could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : labyrinth could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : color could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : spirit could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : solitude could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : past could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : start could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : achievement could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : acceptance could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : mind could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : navigate could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : kindness could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : witness could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : achieve could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : warmth could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : loneliness could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : frustration could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : companion could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : lead could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : landscape could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : story could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : movie could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : year could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : winter could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : enthusiasm could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : one could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : happiness could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : confusion could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : build could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : stand could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : serene could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : discover could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : wonder could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : grief could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : overwhelm could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : success could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : path could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : within could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : linger could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : knowledge could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : long could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : fall could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : wander could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : step could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : ruin could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : vibrant could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : captivate could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : cherish could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : friendship could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : spend could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : tell could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : level could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : celebrate could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : work could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : cook could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : tomorrow could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : midst could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : bring could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : complete could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : elation could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : victory could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : peace could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : bright could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : compassion could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : star could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : possibility could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : towards could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : betrayal could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : away could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : poison could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : fuel could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : shield could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : determination could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : face could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : tranquil could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : thread could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : emotional could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : play could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : canvas could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : shadow could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : glow could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : eye could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : optimism could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : sorrow could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : empathy could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : rain could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : soar could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : conquer could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : rhythm could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : attempt could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : note could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : perfect could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : series could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : capture could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : age could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : participate could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : enjoy could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : watch could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : miss could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : milestone could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : road could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : inspire could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : birthday could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : send could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : overflow could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : anticipate could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : bitter could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : decision could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : pride could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : charity could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : await could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : much could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : euphoria could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : receive could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : celebration could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : sea could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : jealousy could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : resentment could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : constant could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : dark could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : indifference could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : numbness could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : catch could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : ambivalence could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : uncertainty could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : self could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : immerse could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : mystery could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : realm could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : golden could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : accomplishment could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : seed could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : masterpiece could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : disappointment could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : grandeur could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : successfully could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : resonate could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : novel could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : timeless could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : science could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : enthusiast could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : organize could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : bad could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : fitness could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : goal could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : state could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : breathtaking could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : amidst could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : party could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : place could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : support could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : reunion could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : home could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : around could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : historical could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : festival could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : others could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : regret could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : take could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : tapestry could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : photo could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : river could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : free could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : anticipation could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : ballroom could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : craft could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : performance could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : street could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : end could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : magic could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : record could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : unexpected could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : fail could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : snack could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : photography could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : host could not be fit on page. It will not be plotted.

## Warning in wordcloud(socmedia_data$lemma_text, min.freq = 100, max.words = 300,
## : online could not be fit on page. It will not be plotted.

Notice also that the wordcloud here automatically extracts notable n-grams.

The wordcloud2 package adds the ability to add a mask or shape to the wordcloud, which can be interesting.

# Let's make another word cloud of the words obtained originally with tokenization (stored in f, Chunk 21) and wordcloud2()
star<- wordcloud2(stopped.frequencies, size=0.1, minSize=1, color='random-dark', shape='star')

star

Word clouds are great for visualization, quite engaging and provide a basic insight about the text data.

A more directly informative description of a set of texts is the tf-idf plot. Tf-idf stands for term-frequency x Inverse Document Frequency. The term frequency is how often the term appears in the document; inverse document frequency is how rare the term is overall. The product gives an idea of how unexpectedly common the term is in this document. Higher TF-IDF means that the words are much more common here; lower TF-IDF means they are much lower.

We can compute these from tidytext, treating each user as a document.

socmedia_user <- stopped.socmedia %>% 
                mutate(word = lemmatize_words(word)) %>%
                count(User, word, sort=TRUE) %>%
                bind_tf_idf(word, document = User, n = n)
head(socmedia_user)

##              User       word n         tf      idf    tf_idf
## 1 CarnivalDreamer atmosphere 3 0.08333333 5.814131 0.4845109
## 2 CarnivalDreamer      candy 3 0.08333333 6.507278 0.5422731
## 3 CarnivalDreamer   carnival 3 0.08333333 5.408665 0.4507221
## 4 CarnivalDreamer   carousel 3 0.08333333 6.507278 0.5422731
## 5 CarnivalDreamer     cotton 3 0.08333333 6.507278 0.5422731
## 6 CarnivalDreamer      dream 3 0.08333333 3.462755 0.2885629

And draw plots of some individual users:

socmedia_user %>% filter(User == "CarnivalDreamer") %>%
    ggplot(aes(reorder(word, tf_idf), tf_idf)) + 
    geom_col() + 
    theme(axis.text.x = element_text(angle = 45, hjust=1))

And (code thanks to the tidytext book), we can compare different users in their use of words.

# Filter out top users
big_posters <- socmedia_user %>% 
                group_by(User) %>% 
                summarize(n=n()) %>% 
                ungroup() %>%
                top_n(4, wt=n)
socmedia_user %>%
  filter(User %in% big_posters$User) %>%
  group_by(User) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, reorder(word, tf_idf), fill = User)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~User, ncol = 2, scales = "free") +
  labs(x = "tf-idf", y = NULL)

Similarly, we can compare, for example, hashtags instead of Users.

socmedia_hash <- stopped.socmedia %>% 
                mutate(word = lemmatize_words(word)) %>%
                count(Hashtags, word, sort=TRUE) %>%
                bind_tf_idf(word, document = Hashtags, n = n)
head(socmedia_hash)

##                         Hashtags          word n         tf      idf    tf_idf
## 1 #Compassionate #TearsOfEmpathy compassionate 3 0.08333333 5.846439 0.4872032
## 2 #Compassionate #TearsOfEmpathy    connection 3 0.08333333 4.342361 0.3618634
## 3 #Compassionate #TearsOfEmpathy       empathy 3 0.08333333 5.153292 0.4294410
## 4 #Compassionate #TearsOfEmpathy          fall 3 0.08333333 4.747826 0.3956522
## 5 #Compassionate #TearsOfEmpathy        garden 3 0.08333333 3.831536 0.3192946
## 6 #Compassionate #TearsOfEmpathy        gently 3 0.08333333 6.539586 0.5449655

These results can be visualized, but we’ll also use them in the next part to do clustering. They’re also great inputs for a classifier, or to distinguish groups from one another.

Text Mining_BERD Workshop

Dr. Timothy R Brick, Priyanka Paul

2024-05-07