Word2Vec is a word embedding technique designed to build a mathematical representation of words in a vector space. It’s a good intermediate representation if you’re interested in semantic clustering or modeling differences between words or documents, and you don’t care much about what the distances mean.
Word2Vec provides two variants the Continuous Bag of Words (CBOW) and Skip-gram models.The CBOW model predicts a target word based on a context of surrounding words, making it efficient and particularly effective at representing more frequent words numerically. Skipgrams go the other way, predicting the context from the word (Agarwal, 2022). We’ll use CBOW embeddings for this example.
# Load the libraries
library(tidyverse) # For string handling
library(doc2vec) # For document vectors
library(word2vec)
library(tm)
We will continue using the same dataset (socmedia_data and Text column containing social media posts). If you haven’t read it in, here’s a way to do it:
socmedia_data<-read.csv("sentimentdataset.csv", header = T) # Read the data
socmedia_data <- socmedia_data %>% # handle spaces in the labels
mutate(across(c(Platform, Country, User, Sentiment, Hashtags),
str_trim))
To generate the word2vec embedding, we really just need our text item
set. It’s pretty easy to generate from there. We’ll also need to decide
how deep an embedding we need. The traditions for this type of model use
300 dimensions, mostly for historical reasons. We’ll use a small set
here, since it’s faster for the demo. iter
specifies the
number of iterations, and is tricky to specify. More means a longer run
but a better model. We’ll pick 20 here. If you’re using more than about
50 dimensions or have a small data set, it is often easier to grab a
premade model such as the ones here.
cbow_model = word2vec(x = tolower(socmedia_data$Text), type = "cbow", dim = 200, iter=100)
The embeddings themselves are just vectors of numbers.
cbow_embedding <- predict(cbow_model, newdata = "moment", type = "embedding")
print(cbow_embedding[1:20])
## [1] 2.1466138 -0.9071730 -2.9476144 -0.2869101 0.7323790 -0.5696082
## [7] 0.4752676 0.4625454 0.4398554 -1.5073594 0.8115471 -0.4028049
## [13] -1.3447545 0.4367367 0.2425620 0.2854330 0.9129105 -0.3392825
## [19] 1.8070821 0.1293770
But once we have an embedding, we can look at specific words from the data set, and either extract their embeddings, or look at the words near them in the data set.
cbow_lookslike <- predict(cbow_model, "moment", type = "nearest", top_n = 5)
print(cbow_lookslike)
## $moment
## term1 term2 similarity rank
## 1 moment standing 0.7390520 1
## 2 moment after 0.6605819 2
## 3 moment enjoying 0.6510534 3
## 4 moment nature 0.6414363 4
## 5 moment empathy 0.6403733 5
Moment has similarity with “standing”, because of waiting, “after”, because of time, and “serene”, because there’s a lot of mindfulness in this data. Birthday and against are less clear.
We can get rough guesses for embedding sentences by averaging the word-vectors:
sentence_embedding <- predict(cbow_model, c("this", "is", "a", "moment"), type = "embedding")
sentence_centroid <- colMeans(sentence_embedding)
predict(cbow_model, sentence_centroid, type = "nearest", top_n=10)
## term similarity rank
## 1 is 0.5935047 1
## 2 this 0.5924123 2
## 3 a 0.5667257 3
## 4 vibes 0.5564931 4
## 5 enjoying 0.5343214 5
## 6 story 0.5206150 6
## 7 timeless 0.5179918 7
## 8 moment 0.4979317 8
## 9 good 0.4970452 9
## 10 level 0.4889371 10
These give us an idea of the words closes to the meaning of the sentence.
In the same way that we process words, we can include documents in the prediction sequence, and get approximate locations for them in space. Here, we might embed tweets in the prediction space, so that each of those has a “topic” location.
doc2vec requires us to do some renaming, and again we’ll use a low number of dimensions and iterations for tutorial purposes.
socmedia_docs <- socmedia_data %>% mutate(doc_id=ID, text=tolower(Text))
socmedia_doc2Vec <- paragraph2vec(socmedia_docs, type="PV-DM", dim=200, iter=50)
socmedia_doc2Vec
## $model
## <pointer: 0x159eafd10>
##
## $data
## $data$file
## [1] "/var/folders/k9/tgjl29r50ng97gq_s8815jpr0000gp/T//RtmpeciBzv/textspace_161636a98b558.txt"
##
## $data$n
## [1] 6442
##
## $data$n_vocabulary
## [1] 274
##
## $data$n_docs
## [1] 732
##
##
## $control
## $control$min_count
## [1] 5
##
## $control$dim
## [1] 200
##
## $control$window
## [1] 5
##
## $control$iter
## [1] 50
##
## $control$lr
## [1] 0.05
##
## $control$skipgram
## [1] FALSE
##
## $control$hs
## [1] 0
##
## $control$negative
## [1] 5
##
## $control$sample
## [1] 0.001
##
##
## attr(,"class")
## [1] "paragraph2vec_trained"
Following the same process, we can get the closest words or posts to an example:
predict(socmedia_doc2Vec, newdata = "moment", type = "nearest", top_n=5, which="word2word")
## [[1]]
## term1 term2 similarity rank
## 1 moment standing 0.6422859 1
## 2 moment loved 0.6232980 2
## 3 moment up 0.5721361 3
## 4 moment cup 0.5621931 4
## 5 moment not 0.5616020 5
Notice that the nearest words are now different. But also that we can suddenly predict documents.
predict(socmedia_doc2Vec, newdata = "for", type = "nearest", top_n=10, which="word2doc")
## [[1]]
## term1 term2 similarity rank
## 1 for 143 0.5621778 1
## 2 for 540 0.4493423 2
## 3 for 721 0.4390456 3
## 4 for 59 0.4300495 4
## 5 for 75 0.4293440 5
## 6 for 156 0.4143451 6
## 7 for 710 0.4125738 7
## 8 for 675 0.4091693 8
## 9 for 34 0.4068425 9
## 10 for 591 0.4007642 10
print(socmedia_docs$Text[141])
## [1] " Gratitude for the supportive community around me. "
print(socmedia_docs$Text[536])
## [1] "At the Oscars, the actor graciously accepts an award, radiating joy and gratitude for the recognition of their outstanding performance. "