PART 3

Deep word embeddings

Word2Vec is a word embedding technique designed to build a mathematical representation of words in a vector space. It’s a good intermediate representation if you’re interested in semantic clustering or modeling differences between words or documents, and you don’t care much about what the distances mean.

Word2Vec provides two variants the Continuous Bag of Words (CBOW) and Skip-gram models.The CBOW model predicts a target word based on a context of surrounding words, making it efficient and particularly effective at representing more frequent words numerically. Skipgrams go the other way, predicting the context from the word (Agarwal, 2022). We’ll use CBOW embeddings for this example.

# Load the libraries 
library(tidyverse)  # For string handling
library(doc2vec)    # For document vectors
library(word2vec) 
library(tm)

Continuous Bag Of Words (CBOW) Models

We will continue using the same dataset (socmedia_data and Text column containing social media posts). If you haven’t read it in, here’s a way to do it:

socmedia_data<-read.csv("sentimentdataset.csv", header = T)  # Read the data

socmedia_data <- socmedia_data %>%  # handle spaces in the labels
                  mutate(across(c(Platform, Country, User, Sentiment, Hashtags),
                                str_trim))

To generate the word2vec embedding, we really just need our text item set. It’s pretty easy to generate from there. We’ll also need to decide how deep an embedding we need. The traditions for this type of model use 300 dimensions, mostly for historical reasons. We’ll use a small set here, since it’s faster for the demo. iter specifies the number of iterations, and is tricky to specify. More means a longer run but a better model. We’ll pick 20 here. If you’re using more than about 50 dimensions or have a small data set, it is often easier to grab a premade model such as the ones here.

cbow_model = word2vec(x = tolower(socmedia_data$Text), type = "cbow", dim = 200, iter=100)

The embeddings themselves are just vectors of numbers.

cbow_embedding <- predict(cbow_model, newdata = "moment", type = "embedding") 
print(cbow_embedding[1:20])

##  [1]  2.1466138 -0.9071730 -2.9476144 -0.2869101  0.7323790 -0.5696082
##  [7]  0.4752676  0.4625454  0.4398554 -1.5073594  0.8115471 -0.4028049
## [13] -1.3447545  0.4367367  0.2425620  0.2854330  0.9129105 -0.3392825
## [19]  1.8070821  0.1293770

But once we have an embedding, we can look at specific words from the data set, and either extract their embeddings, or look at the words near them in the data set.

cbow_lookslike <- predict(cbow_model, "moment", type = "nearest", top_n = 5) 

print(cbow_lookslike)

## $moment
##    term1    term2 similarity rank
## 1 moment standing  0.7390520    1
## 2 moment    after  0.6605819    2
## 3 moment enjoying  0.6510534    3
## 4 moment   nature  0.6414363    4
## 5 moment  empathy  0.6403733    5

Moment has similarity with “standing”, because of waiting, “after”, because of time, and “serene”, because there’s a lot of mindfulness in this data. Birthday and against are less clear.

We can get rough guesses for embedding sentences by averaging the word-vectors:

sentence_embedding <- predict(cbow_model, c("this", "is", "a", "moment"), type = "embedding") 
sentence_centroid <- colMeans(sentence_embedding)
predict(cbow_model, sentence_centroid, type = "nearest", top_n=10)

##        term similarity rank
## 1        is  0.5935047    1
## 2      this  0.5924123    2
## 3         a  0.5667257    3
## 4     vibes  0.5564931    4
## 5  enjoying  0.5343214    5
## 6     story  0.5206150    6
## 7  timeless  0.5179918    7
## 8    moment  0.4979317    8
## 9      good  0.4970452    9
## 10    level  0.4889371   10

These give us an idea of the words closes to the meaning of the sentence.

Doc2Vec models

In the same way that we process words, we can include documents in the prediction sequence, and get approximate locations for them in space. Here, we might embed tweets in the prediction space, so that each of those has a “topic” location.

doc2vec requires us to do some renaming, and again we’ll use a low number of dimensions and iterations for tutorial purposes.

socmedia_docs <- socmedia_data %>% mutate(doc_id=ID, text=tolower(Text))
socmedia_doc2Vec <- paragraph2vec(socmedia_docs, type="PV-DM", dim=200, iter=50)
socmedia_doc2Vec

## $model
## <pointer: 0x159eafd10>
## 
## $data
## $data$file
## [1] "/var/folders/k9/tgjl29r50ng97gq_s8815jpr0000gp/T//RtmpeciBzv/textspace_161636a98b558.txt"
## 
## $data$n
## [1] 6442
## 
## $data$n_vocabulary
## [1] 274
## 
## $data$n_docs
## [1] 732
## 
## 
## $control
## $control$min_count
## [1] 5
## 
## $control$dim
## [1] 200
## 
## $control$window
## [1] 5
## 
## $control$iter
## [1] 50
## 
## $control$lr
## [1] 0.05
## 
## $control$skipgram
## [1] FALSE
## 
## $control$hs
## [1] 0
## 
## $control$negative
## [1] 5
## 
## $control$sample
## [1] 0.001
## 
## 
## attr(,"class")
## [1] "paragraph2vec_trained"

Following the same process, we can get the closest words or posts to an example:

predict(socmedia_doc2Vec, newdata = "moment", type = "nearest", top_n=5, which="word2word")

## [[1]]
##    term1    term2 similarity rank
## 1 moment standing  0.6422859    1
## 2 moment    loved  0.6232980    2
## 3 moment       up  0.5721361    3
## 4 moment      cup  0.5621931    4
## 5 moment      not  0.5616020    5

Notice that the nearest words are now different. But also that we can suddenly predict documents.

predict(socmedia_doc2Vec, newdata = "for", type = "nearest", top_n=10, which="word2doc")

## [[1]]
##    term1 term2 similarity rank
## 1    for   143  0.5621778    1
## 2    for   540  0.4493423    2
## 3    for   721  0.4390456    3
## 4    for    59  0.4300495    4
## 5    for    75  0.4293440    5
## 6    for   156  0.4143451    6
## 7    for   710  0.4125738    7
## 8    for   675  0.4091693    8
## 9    for    34  0.4068425    9
## 10   for   591  0.4007642   10

print(socmedia_docs$Text[141])

## [1] " Gratitude for the supportive community around me.    "

print(socmedia_docs$Text[536])

## [1] "At the Oscars, the actor graciously accepts an award, radiating joy and gratitude for the recognition of their outstanding performance. "

Text Mining_BERD Workshop

Dr. Timothy R Brick, Priyanka Paul

2024-05-07

PART 3

Deep word embeddings

Continuous Bag Of Words (CBOW) Models

Doc2Vec models