Vector Representation of Text – Word Embeddings with word2vec

Computers can not understand the text. We need to convert text into numerical vectors before any kind of text analysis like text clustering or classification. The classical well known model is bag of words (BOW). With this model we have one dimension per each unique word in vocabulary. We represent the document as vector with 0s and 1s. We use 1 if the word from vocabulary exists in the document.

Recently new models with word embedding in machine learning gained popularity since they allow to keep semantic information. With word embeddings we can get lower dimensionality than with BOW model. There are several such models for example Glove, word2vec that are used in machine learning text analysis.

Many examples on the web are showing how to operate at word level with word embeddings methods but in the most cases we are working at the document level (sentence, paragraph or document) To get understanding how it can be used for text analytics I decided to take word2vect and create small practical example.

In this post you will learn how to use word embedding word2vect method for converting sentence into numerical vector. The same technique can be used for text with more than one sentence. We will create python script that converts sentences into numerical vectors.

Input

For the input for this script we will use hard coded in the script sentences. The sentences in the script will be already tokenized. Below you can find sentences for our input. Note that sentences 6 and 7 are more distinguish from other sentences.

1 [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
2 ['this', 'is',  'another', 'book'],
3 ['one', 'more', 'book'],
4 ['this', 'is', 'the', 'new', 'post'],
5 ['this', 'is', 'about', 'machine', 'learning', 'post'], 
6 ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
7 ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
8 ['and', 'this', 'is', 'the', 'last', 'post']]

With word2vec you have two options:
1. Create your own word2vec
2. Use pretrained data from Google

From word to sentence

Each word in word embeddings is represented by the vector. But let’s say we are working with tweets from twitter and need to know how similar or dissimilar are tweets? So we need to have vector representation of whole text in tweet. To achieve this we can do average word embeddings for each word in sentence (or tweet or paragraph) The idea come from paper [1]. In this paper the authors averaged word embeddings to get paragraph vector.

Source code for conversion


Below in Listing A and Listing B you can find how we can average word embeddings and get numerical vectors.
Listing A has the python source code for using own word embeddings.
Listing B has the python source code for using word embeddings from Google.
The script is taking embeddings from local file that was downloaded from Google before. You can find in this post Using Pretrained Word Embeddings in Machine Learning more details on downloading word embeddings from Google.

When averaging embeddings I was using 50 first dimensions. This is the minimal number that was used in one of the papers. The recommendation is to use between 100-400 dimensions.

Analysis of Results

How do we know that our results are good? We will do here a quick check as following. We will calculate the distance (similarity measure) between vectors and will compare with our expectation. If text sentences belong to different context then we expect the distance will be more and if sentences are close together then distance will be less. Because context of sentences 6 and 7 is different from other sentences we would expect to see this difference in results.

For calculating distance we use in the script cosine measure. With cosine measure most similar will be the one that have the highest cosine value. Below are results:
Note that 0 values mean that cosine value was not calculated because there is no need to do this. ( value already calculated for example for doc21 = doc12 or the value is on diagonal )

Results from Listing A (using own web embedings)
 1   2    3    4    5    6    7    8
1[0, 0.5, 0.1, 0.5, 0.6, 0.4, 0.2, 0.4],
2[0, 0,   0.2, 0.6, 0.5, 0.2, 0.1, 0.5],
3[0, 0,   0,   0.0, 0.0, 0.0, 0.1, 0.0],
4[0, 0,   0,   0,   0.6, 0.5, 0.3, 0.7],
5[0, 0,   0,   0,   0,   0.2, 0.2, 0.6],
6[0, 0,   0,   0,   0,   0,   0.4, 0.4], 
7[0, 0,   0,   0,   0,   0,   0,   0.3], 
8[0, 0,   0,   0,   0,   0,   0,   0]


Results from Listing B (using pretrained dataset):
  1  2     3     4     5     6     7     8
1[0, 0.77, 0.33, 0.57, 0.78, 0.35, 0.37, 0.55],
2[0, 0,    0.60, 0.62, 0.51, 0.31, 0.29, 0.59],
3[0, 0,    0,    0.16, 0.12, 0.18, 0.25, 0.11], 
4[0, 0,    0,    0,    0.62, 0.41, 0.37, 0.89],
5[0, 0,    0,    0,    0,    0.35, 0.27, 0.61], 
6[0, 0,    0,    0,    0,    0,    0.81, 0.37], 
7[0, 0,    0,    0,    0,    0,    0,    0.32],
8[0, 0,    0,    0,    0,    0,    0,    0]]

Looking at results we can see that our expectations are confirmed especially on results where pretrained word embeddings were used. Sentences 6,7 have low similarity with other sentences but have high similarity 0.81 when we compare sentence 6 with 7.

Conclusion

In this post we considered how to represent document (sentence, paragraph) as vector of numbers using word embeddings model word2vec. We looked at 2 possible ways – using own embeddings and using embeddings from Google. We got results for our small example and we were able to evaluate the results.

Now we can feed vector representation of text into machine learning text analysis algorithms.

Here are a few posts where you can find how to feed word2vec word embedding in text clustering algorithms such as kmeans from NLTK and sklearn libraries and how to plot data with TSNE :
K Means Clustering Example with Word2Vec in Data Mining or Machine Learning
Text Clustering with Word Embedding in Machine Learning

Below are few links for different word embedding models that are also widely used:
GloVe –
How to Convert Word to Vector with GloVe and Python
fastText –
FastText Word Embeddings

I hope you enjoyed this post about representing text as vector using word2vec. If you have any tips or anything else to add, please leave a comment in the reply box.

Listing A. Here is the python source code for using own word embeddings

from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'], 
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
			['and', 'this', 'is', 'the', 'last', 'post']]




model = Word2Vec(sentences, min_count=1, size=100)
vocab = model.vocab.keys()
wordsInVocab = len(vocab)
print (model.similarity('post', 'book'))


import numpy as np

def sent_vectorizer(sent, model):
    sent_vec = np.zeros(100)
    numw = 0
    for w in sent:
        try:
            sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
    return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))

from numpy import dot
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))] 

for i in range (len(V) - 1):
    for j in range(i+1, len(V)):
           results[i][j] = dot(V[i],V[j])/norm(V[i])/norm(V[j])


print (results)

Listing B. Here is the python source code for using word embeddings from Google.

import gensim
model = gensim.models.Word2Vec.load_word2vec_format('C:\\Users\\Downloads\\GoogleNews-vectors-negative300.bin', binary=True)  

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'], 
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
			['and', 'this', 'is', 'the', 'last', 'post']]


vocab = model.vocab.keys()
wordsInVocab = len(vocab)

import numpy as np

def sent_vectorizer(sent, model):
    sent_vec = np.zeros(50)
    numw = 0
    for w in sent:
        try:
            vc=model[w]
            vc=vc[0:50]
           
            sent_vec = np.add(sent_vec, vc) 
            numw+=1
        except:
            pass
    return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))] 

for i in range (len(V) - 1):
    for j in range(i+1, len(V)):
           
       NVI=norm(V[i])
       NVJ=norm(V[j])
           
       dotVij =0
       NVI=0
       for x in range(50):
           NVI=NVI +  V[i][x]*V[i][x]
           
       NVJ=0
       for x in range(50):
           NVJ=NVJ +  V[j][x]*V[j][x]
            
       for x in range(50):
      
               dotVij = dotVij + V[i][x] * V[j][x]
         
      
       results[i][j] = dotVij / (NVI*NVJ) 

print (results)

References
1. Document Embedding with Paragraph Vectors

3 thoughts on “Vector Representation of Text – Word Embeddings with word2vec

Leave a Comment