Once again, we need to load in all the libraries we’ll be using.
library(tidyverse)
library(tidytext) # Tidy text mining
library(textstem) # Lemmatization and stemming
library(quanteda) # DTMs and DFMs
library(quanteda.textstats) # For distances
library(SentimentAnalysis) # For Sentiment Analysis
library (stm) # For LDA
library(tm) # DTM/TDM
library(broom) # For tidy()ing
If you haven’t run the part 1 document, we’ll run the pre-processing steps really quickly here:
socmedia_data<-read.csv("sentimentdataset.csv", header = T) # Read the data
socmedia_data <- socmedia_data %>% # handle spaces in the labels
mutate(across(c(Platform, Country, User, Sentiment, Hashtags),
str_trim))
# Tokenize and remove stopwords
socmedia_stoptoken <- socmedia_data %>%
unnest_tokens(word, Text) %>%
filter(!word %in% stop_words$word)
# Additionally, make a DTM with posts as documents
socmedia_dtm <- socmedia_stoptoken %>%
mutate(word = lemmatize_words(word)) %>% # lemmatize
count(ID, word, sort=FALSE) %>%
arrange(ID) %>%
cast_dtm(ID, word, n)
Sentiment analysis is performed in a bag-of-words framework by comparing each term to a sentiment dictionary, and totaling the sum of the sentiment elements for those terms.
Most often, this is done with “positive”, “negative”, and “neutral”
as the possible sentiment categories, rather than the many tags that we
have in the social media data set. We can do this quickly using
analyzeSentiment()
from the SentimentAnalysis
package.
Note that sentiment requires a dictionary. The package comes with several.
# Analyze sentiment using analyzeSentiment()
# Quite simple and neat!
socmedia_sentiment <- analyzeSentiment(tolower(socmedia_data$Text))
socmedia_data$SentimentQDAP <- socmedia_sentiment$SentimentQDAP
This gives us a set of different estimates, each driven by a different sentiment lexicon. The letters after “Sentiment” or “Positivity” or “Negativity” indicate which dictionary was used. There is no clear best. See the help pages to learn about each one. Here, we’ll use QDAP.
# Check out the results from a given dictionary just as an object.
head(socmedia_sentiment$SentimentQDAP)
## [1] 0.5000000 -0.3333333 0.5000000 0.2500000 0.0000000 0.0000000
We can simplify the sentiment model even further if we want to separate positive from negative tweets.
# View sentiment direction (i.e. positive, neutral and negative)
sentiment_direction<- convertToDirection(socmedia_sentiment$SentimentQDAP)
head(sentiment_direction)
## [1] positive negative positive positive neutral neutral
## Levels: negative neutral positive
The first few rows of our social media database have apparently been rated in terms of positivity and negativity (probably because they are tests for other models). We can see that these line up pretty well with our predictions from the sentiment analyzer.
head(data.frame(QDAP=socmedia_sentiment$SentimentQDAP,
sentiment_direction,
socmedia_data$Sentiment))
## QDAP sentiment_direction socmedia_data.Sentiment
## 1 0.5000000 positive Positive
## 2 -0.3333333 negative Negative
## 3 0.5000000 positive Positive
## 4 0.2500000 positive Positive
## 5 0.0000000 neutral Neutral
## 6 0.0000000 neutral Positive
Not a perfect match, certainly. That’s because different dictionaries will use different words, and therefore get slightly different results.
We can plot or fit sentiment against other features of the data set. For example, across time:
ggplot(socmedia_data, aes(x=Year, y=SentimentQDAP)) + geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'
Sentiment here is still generally positive. But that sure is changing. Note that this kind of analysis could also be applied user-by-user or hashtag-by-hashtag.
Before applying LDA, it is necessary to prepare a Document Term Matrix (DTM).
Latent Dirichlet Allocation (LDA) is an iterative technique that identifies topics within a set of documents by analyzing the frequency of words, which are represented in discrete form. The underlying concept of LDA is that documents typically focus on a limited number of topics, and similarly, these topics are generally comprised of a limited set of words.
We will again refer to our social media dataset to performing LDA. We
can use a document-feature matrix as a starting point using the
stm()
socmedia_dfm <- socmedia_stoptoken %>%
count(ID, word) %>% # count each word used in each identified review
cast_dfm(ID, word, n) # convert to a document-feature matrix
The stm
package will let us run an LDA from there.
socmedia_lda <- stm(
socmedia_dfm,
K = 3, # K is the number of topics we want the model to produce
verbose=FALSE,
seed=422 # Enforces replicability
)
socmedia_lda
## A topic model with 3 topics, 732 documents and a 2328 word dictionary.
We might want to see how many topics are really supported by our
data, though. A search through different K can give us a good guess; the
searchK()
function can do that, if you are interested.
k_options <- searchK(socmedia_dfm, K=3:7, N=100, verbose=FALSE)
plot(k_options)
Here, it looks like 3 may not be the best choice: 6 has more coherence,
a lower heldout likelihood, and lower residuals. See the
?searchK
function for more details. We’ll continue with 3
for this tutorial, though.
STM models are a more general form that also allows you to set covariates on your topic model. See Roberts, et al., 2016 for more.
We can tidy()
the result to bring pieces back into a
tidytext
-friendly format. Here, we extract the
per-topic-per-word probabilities, called β (“beta”) weights, from the
model.
#Extract the topics
lda_topics <- tidy(socmedia_lda, matrix = "beta")
This lets us rearrange things and run plots in typical tidy style.
lda_topics %>%
arrange(topic, desc(beta)) %>%
#arrange in descending order within topic
head()
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 dreams 0.00750
## 2 1 night 0.00720
## 3 1 life 0.00660
## 4 1 beauty 0.00600
## 5 1 journey 0.00529
## 6 1 emotions 0.00510
We can see that “dreams”, “night” and “life” are the most probable terms for topic 1. We can get a similar list for topic 2, and look at the differences.
top_terms <- lda_topics %>%
group_by(topic) %>%
slice_max(beta, n = 15) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered() +
theme(axis.text.x = element_text(angle = 45, hjust=1))