PART 2

Load necessary libraries again

Once again, we need to load in all the libraries we’ll be using.

library(tidyverse)
library(tidytext)             # Tidy text mining
library(textstem)             # Lemmatization and stemming 
library(quanteda)             # DTMs and DFMs
library(quanteda.textstats)   # For distances
library(SentimentAnalysis)    # For Sentiment Analysis
library (stm)                 # For LDA 
library(tm)                   # DTM/TDM
library(broom)                # For tidy()ing

Setup:

If you haven’t run the part 1 document, we’ll run the pre-processing steps really quickly here:

socmedia_data<-read.csv("sentimentdataset.csv", header = T)  # Read the data
socmedia_data <- socmedia_data %>%  # handle spaces in the labels
                  mutate(across(c(Platform, Country, User, Sentiment, Hashtags),
                                str_trim))

# Tokenize and remove stopwords
socmedia_stoptoken <- socmedia_data %>%
  unnest_tokens(word, Text) %>%
  filter(!word %in% stop_words$word)

# Additionally, make a DTM with posts as documents
socmedia_dtm <- socmedia_stoptoken %>% 
                mutate(word = lemmatize_words(word)) %>% # lemmatize
                count(ID, word, sort=FALSE) %>% 
                arrange(ID) %>%
                cast_dtm(ID, word, n)

Sentiment Analysis

Sentiment analysis is performed in a bag-of-words framework by comparing each term to a sentiment dictionary, and totaling the sum of the sentiment elements for those terms.

Most often, this is done with “positive”, “negative”, and “neutral” as the possible sentiment categories, rather than the many tags that we have in the social media data set. We can do this quickly using analyzeSentiment() from the SentimentAnalysis package.

Note that sentiment requires a dictionary. The package comes with several.

# Analyze sentiment using analyzeSentiment()
# Quite simple and neat! 
socmedia_sentiment <- analyzeSentiment(tolower(socmedia_data$Text))
socmedia_data$SentimentQDAP <- socmedia_sentiment$SentimentQDAP

This gives us a set of different estimates, each driven by a different sentiment lexicon. The letters after “Sentiment” or “Positivity” or “Negativity” indicate which dictionary was used. There is no clear best. See the help pages to learn about each one. Here, we’ll use QDAP.

# Check out the results from a given dictionary just as an object.
head(socmedia_sentiment$SentimentQDAP)
## [1]  0.5000000 -0.3333333  0.5000000  0.2500000  0.0000000  0.0000000

We can simplify the sentiment model even further if we want to separate positive from negative tweets.

# View sentiment direction (i.e. positive, neutral and negative)
sentiment_direction<- convertToDirection(socmedia_sentiment$SentimentQDAP)
head(sentiment_direction)
## [1] positive negative positive positive neutral  neutral 
## Levels: negative neutral positive

The first few rows of our social media database have apparently been rated in terms of positivity and negativity (probably because they are tests for other models). We can see that these line up pretty well with our predictions from the sentiment analyzer.

head(data.frame(QDAP=socmedia_sentiment$SentimentQDAP, 
                sentiment_direction,
                socmedia_data$Sentiment))
##         QDAP sentiment_direction socmedia_data.Sentiment
## 1  0.5000000            positive                Positive
## 2 -0.3333333            negative                Negative
## 3  0.5000000            positive                Positive
## 4  0.2500000            positive                Positive
## 5  0.0000000             neutral                 Neutral
## 6  0.0000000             neutral                Positive

Not a perfect match, certainly. That’s because different dictionaries will use different words, and therefore get slightly different results.

Visualization

We can plot or fit sentiment against other features of the data set. For example, across time:

ggplot(socmedia_data, aes(x=Year, y=SentimentQDAP)) + geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'

Sentiment here is still generally positive. But that sure is changing. Note that this kind of analysis could also be applied user-by-user or hashtag-by-hashtag.

Topic Modelling via Latent Dirichlet Analysis (LDA)

Before applying LDA, it is necessary to prepare a Document Term Matrix (DTM).

Latent Dirichlet Allocation (LDA) is an iterative technique that identifies topics within a set of documents by analyzing the frequency of words, which are represented in discrete form. The underlying concept of LDA is that documents typically focus on a limited number of topics, and similarly, these topics are generally comprised of a limited set of words.

We will again refer to our social media dataset to performing LDA. We can use a document-feature matrix as a starting point using the stm()

socmedia_dfm <- socmedia_stoptoken %>%
    count(ID, word) %>%  # count each word used in each identified review 
    cast_dfm(ID, word, n)  # convert to a document-feature matrix

The stm package will let us run an LDA from there.

socmedia_lda <- stm(
  socmedia_dfm,
  K = 3, # K is the number of topics we want the model to produce
  verbose=FALSE,
  seed=422   # Enforces replicability
)

socmedia_lda
## A topic model with 3 topics, 732 documents and a 2328 word dictionary.

We might want to see how many topics are really supported by our data, though. A search through different K can give us a good guess; the searchK() function can do that, if you are interested.

k_options <- searchK(socmedia_dfm, K=3:7, N=100, verbose=FALSE)
plot(k_options)

Here, it looks like 3 may not be the best choice: 6 has more coherence, a lower heldout likelihood, and lower residuals. See the ?searchK function for more details. We’ll continue with 3 for this tutorial, though.

STM models are a more general form that also allows you to set covariates on your topic model. See Roberts, et al., 2016 for more.

Examining the topics themselves

We can tidy() the result to bring pieces back into a tidytext-friendly format. Here, we extract the per-topic-per-word probabilities, called β (“beta”) weights, from the model.

#Extract the topics 
lda_topics <- tidy(socmedia_lda, matrix = "beta") 

This lets us rearrange things and run plots in typical tidy style.

lda_topics %>%
  arrange(topic, desc(beta)) %>%
  #arrange in descending order within topic
  head()
## # A tibble: 6 × 3
##   topic term        beta
##   <int> <chr>      <dbl>
## 1     1 dreams   0.00750
## 2     1 night    0.00720
## 3     1 life     0.00660
## 4     1 beauty   0.00600
## 5     1 journey  0.00529
## 6     1 emotions 0.00510

We can see that “dreams”, “night” and “life” are the most probable terms for topic 1. We can get a similar list for topic 2, and look at the differences.

top_terms <- lda_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 15) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))