Text Clustering with Word Embedding in Machine Learning


Text clustering is widely used in many applications such as recommender systems, sentiment analysis, topic selection, user segmentation. Word embeddings (for example word2vec) allow to exploit ordering
of the words and semantics information from the text corpus. In this blog you can find several posts dedicated different word embedding models:

GloVe –
How to Convert Word to Vector with GloVe and Python
fastText –
FastText Word Embeddings
word2vec –
Vector Representation of Text – Word Embeddings with word2vec
word2vec application –
Text Analytics Techniques with Embeddings
Using Pretrained Word Embeddinigs in Machine Learning
K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

In contrast to last post from the above list, in this post we will discover how to do text clustering with word embeddings at sentence (phrase) level. The sentence could be a few words, phrase or paragraph like tweet. For examples we have 1000 of tweets and want to group in several clusters. So each cluster would contain one or more tweets.

Data

Our data will be the set of sentences (phrases) containing 2 topics as below:
Note: I highlighted in bold 3 sentences on weather topic, all other sentences have totally different topic.
sentences = [[‘this’, ‘is’, ‘the’, ‘one’,’good’, ‘machine’, ‘learning’, ‘book’],
[‘this’, ‘is’, ‘another’, ‘book’],
[‘one’, ‘more’, ‘book’],
[‘weather’, ‘rain’, ‘snow’],
[‘yesterday’, ‘weather’, ‘snow’],
[‘forecast’, ‘tomorrow’, ‘rain’, ‘snow’],

[‘this’, ‘is’, ‘the’, ‘new’, ‘post’],
[‘this’, ‘is’, ‘about’, ‘more’, ‘machine’, ‘learning’, ‘post’],
[‘and’, ‘this’, ‘is’, ‘the’, ‘one’, ‘last’, ‘post’, ‘book’]]

Word Embedding Method

For embeddings we will use gensim word2vec model. There is also doc2vec model – but we will use it at next post.
With the need to do text clustering at sentence level there will be one extra step for moving from word level to sentence level. For each sentence from the set of sentences, word embedding of each word is summed and in the end divided by number of words in the sentence. So we are getting average of all word embeddings for each sentence and use them as we would use embeddings at word level – feeding to machine learning clustering algorithm such k-means.

Here is the example of the function that doing this:

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
    
    return np.asarray(sent_vec) / numw

Now we will use text clustering Kmeans algorithm with word2vec model for embeddings. For kmeans algorithm we will use 2 separate implementations with different libraries NLTK for KMeansClusterer and sklearn for cluster. This was described in previous posts (see the list above).

The code for this article can be found in the end of this post. We use 2 for number of clusters in both k means text clustering algorithms.
Additionally we will plot data using tSNE.

Output

Below are results

[1, 1, 1, 0, 0, 0, 1, 1, 1]

Cluster id and sentence:
1:[‘this’, ‘is’, ‘the’, ‘one’, ‘good’, ‘machine’, ‘learning’, ‘book’]
1:[‘this’, ‘is’, ‘another’, ‘book’]
1:[‘one’, ‘more’, ‘book’]
0:[‘weather’, ‘rain’, ‘snow’]
0:[‘yesterday’, ‘weather’, ‘snow’]
0:[‘forecast’, ‘tomorrow’, ‘rain’, ‘snow’]

1:[‘this’, ‘is’, ‘the’, ‘new’, ‘post’]
1:[‘this’, ‘is’, ‘about’, ‘more’, ‘machine’, ‘learning’, ‘post’]
1:[‘and’, ‘this’, ‘is’, ‘the’, ‘one’, ‘last’, ‘post’, ‘book’]

Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):
-0.0008175040203510163
Silhouette_score:
0.3498247

Cluster id and sentence:
1 [‘this’, ‘is’, ‘the’, ‘one’, ‘good’, ‘machine’, ‘learning’, ‘book’]
1 [‘this’, ‘is’, ‘another’, ‘book’]
1 [‘one’, ‘more’, ‘book’]
0 [‘weather’, ‘rain’, ‘snow’]
0 [‘yesterday’, ‘weather’, ‘snow’]
0 [‘forecast’, ‘tomorrow’, ‘rain’, ‘snow’]

1 [‘this’, ‘is’, ‘the’, ‘new’, ‘post’]
1 [‘this’, ‘is’, ‘about’, ‘more’, ‘machine’, ‘learning’, ‘post’]
1 [‘and’, ‘this’, ‘is’, ‘the’, ‘one’, ‘last’, ‘post’, ‘book’]

Results of text clustering
Results of text clustering

We see that the data were clustered according to our expectation – different sentences by topic appeared to different clusters. Thus we learned how to do clustering algorithms in data mining or machine learning with word embeddings at sentence level. Here we used kmeans clustering and word2vec embedding model. We created additional function to go from word embeddings to sentence embeddings level. In the next post we will use doc2vec and will not need this function.

Below is full source code python script.

from gensim.models import Word2Vec
 
from nltk.cluster import KMeansClusterer
import nltk
import numpy as np 
 
from sklearn import cluster
from sklearn import metrics
 
# training data
 
sentences = [['this', 'is', 'the', 'one','good', 'machine', 'learning', 'book'],
            ['this', 'is',  'another', 'book'],
            ['one', 'more', 'book'],
            ['weather', 'rain', 'snow'],
            ['yesterday', 'weather', 'snow'],
            ['forecast', 'tomorrow', 'rain', 'snow'],
            ['this', 'is', 'the', 'new', 'post'],
            ['this', 'is', 'about', 'more', 'machine', 'learning', 'post'],  
            ['and', 'this', 'is', 'the', 'one', 'last', 'post', 'book']]
 
 

model = Word2Vec(sentences, min_count=1)

 
def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
    
    return np.asarray(sent_vec) / numw
 
 
X=[]
for sentence in sentences:
    X.append(sent_vectorizer(sentence, model))   

print ("========================")
print (X)


 

# note with some version you would need use this (without wv) 
#  model[model.vocab] 
print (model[model.wv.vocab])


 

print (model.similarity('post', 'book'))
print (model.most_similar(positive=['machine'], negative=[], topn=2))
 
 

 
 
NUM_CLUSTERS=2
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
print (assigned_clusters)
 
 
 
for index, sentence in enumerate(sentences):    
    print (str(assigned_clusters[index]) + ":" + str(sentence))

    
    
    
kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(X)
 
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
 
print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)
 
print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(X))
 
silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')
 
print ("Silhouette_score: ")
print (silhouette_score)


import matplotlib.pyplot as plt

from sklearn.manifold import TSNE

model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)

Y=model.fit_transform(X)


plt.scatter(Y[:, 0], Y[:, 1], c=assigned_clusters, s=290,alpha=.5)


for j in range(len(sentences)):    
   plt.annotate(assigned_clusters[j],xy=(Y[j][0], Y[j][1]),xytext=(0,0),textcoords='offset points')
   print ("%s %s" % (assigned_clusters[j],  sentences[j]))


plt.show()

Document Similarity, Tokenization and Word Vectors in Python with spaCY

Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be done just within few lines. Below you will find how to get document similarity , tokenization and word vectors with spaCY.

spaCY is an open-source library designed to help you build NLP applications. It has a lot of features, we will look in this post only at few but very useful.

Document Similarity

Here is how to get document similarity:

import spacy
nlp = spacy.load('en')

doc1 = nlp(u'Hello this is document similarity calculation')
doc2 = nlp(u'Hello this is python similarity calculation')
doc3 = nlp(u'Hi there')

print (doc1.similarity(doc2)) 
print (doc2.similarity(doc3)) 
print (doc1.similarity(doc3))  

Output:
0.94
0.33
0.30

In more realistic situations we would load documents from files and would have longer text. Here is the experiment that I performed. I saved 3 articles from different random sites, two about deep learning and one about feature engineering.

def get_file_contents(filename):
  with open(filename, 'r') as filehandle:  
    filecontent = filehandle.read()
    return (filecontent) 

fn1="deep_learning1.txt"
fn2="feature_eng.txt"
fn3="deep_learning.txt"

fn1_doc=get_file_contents(fn1)
print (fn1_doc)

fn2_doc=get_file_contents(fn2)
print (fn2_doc)

fn3_doc=get_file_contents(fn3)
print (fn3_doc)
 
doc1 = nlp(fn1_doc)
doc2 = nlp(fn2_doc)
doc3 = nlp(fn3_doc)
 
print ("dl1 - features")
print (doc1.similarity(doc2)) 
print ("feature - dl")
print (doc2.similarity(doc3)) 
print ("dl1 - dl")
print (doc1.similarity(doc3)) 
 
"""
output:
dl1 - features
0.9700237040142454
feature - dl
0.9656364096761337
dl1 - dl
0.9547075478662724
"""


It was able to assign higher similarity score for documents with similar topics!

Tokenization

Another very useful and simple feature that can be done with spaCY is tokenization. Here is how easy to convert text into tokens (words):

for token in doc1:
    print(token.text)
    print (token.vector)

Word Vectors

spaCY has integrated word vectors support, while other libraries like NLTK do not have it. Below line will print word embeddings – array of 768 numbers on my environment.

 
print (token.vector)   #-  prints word vector form of token. 
print (doc1[0].vector) #- prints word vector form of first token of document.
print (doc1.vector)    #- prints mean vector form for doc1

So we looked how to use few features (similarity, tokenization and word embeddings) which are very easy to implement with spaCY. I hope you enjoyed this post. If you have any tips or anything else to add, please leave a comment below.

References
1. spaCY
2. Word Embeddings in Python with Spacy and Gensim

FastText Word Embeddings for Text Classification with MLP and Python

Word embeddings are widely used now in many text applications or natural language processing moddels. In the previous posts I showed examples how to use word embeddings from word2vec Google, glove models for different tasks including machine learning clustering:

GloVe – How to Convert Word to Vector with GloVe and Python

word2vec – Vector Representation of Text – Word Embeddings with word2vec

word2vec application – K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

In this post we will look at fastText word embeddings in machine learning. You will learn how to load pretrained fastText, get text embeddings and do text classification. As stated on fastText site – text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies. [1]

What is fastText

fastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. [1]

fastText, is created by Facebook’s AI Research (FAIR) lab. The model is an unsupervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages.[2]

As per Quora [6], Fasttext treats each word as composed of character ngrams. So the vector for a word is made of the sum of this character n grams. Word2vec (and glove) treat words as the smallest unit to train on. This means that fastText can generate better word embeddings for rare words. Also fastText can generate word embeddings for out of vocabulary word but word2vec and glove can not do this.

Word Embeddings File

I downloaded wiki file wiki-news-300d-1M.vec from here [4], but there are some other links where you can download different data files. I found this one has smaller size so it is easy to work with it.

Basic Operations with fastText Word Embeddings

To get most similar words to some word:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
print (model.most_similar('desk'))

"""
[('desks', 0.7923153638839722), ('Desk', 0.6869951486587524), ('desk.', 0.6602819561958313), ('desk-', 0.6187258958816528), ('credenza', 0.5955315828323364), ('roll-top', 0.5875717401504517), ('rolltop', 0.5837830305099487), ('bookshelf', 0.5758029222488403), ('Desks', 0.5755287408828735), ('sofa', 0.5617446899414062)]
"""

Load words in vocabulary:

words = []
for word in model.vocab:
    words.append(word)

To see embeddings:

print("Vector components of a word: {}".format(
    model[words[0]]
))

"""
Vector components of a word: [-0.0451  0.0052  0.0776 -0.028   0.0289  0.0449  0.0117 -0.0333  0.1055
 .......................................
 -0.1368 -0.0058 -0.0713]
"""

The Problem

So here we will use fastText word embeddings for text classification of sentences. For this classification we will use sklean Multi-layer Perceptron classifier (MLP).
The sentences are prepared and inserted into script:

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'machine', 'learning', 'book'],
			['one', 'more', 'new', 'book'],
		
          ['this', 'is', 'about', 'machine', 'learning', 'post'],
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
          ['this', 'is', 'the', 'last', 'machine', 'learning', 'book'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'packages'],
          ['orange', 'juice', 'is', 'liquid', 'extract', 'from', 'fruit', 'on', 'orange', 'tree']]

The sentences belong to two classes, the labels for classes will be assigned later as 0,1. So our problem is to classify above sentences. Below is the flowchart of the program that we will use for perceptron learning algorithm example.

Text classification using word embeddings
Text classification using word embeddings

Data Preparation

I converted this text input into digital using the following code. Basically I got word embedidings and averaged all words in the sentences. The resulting vector sentence representations were saved to array V.

import numpy as np

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
   
    return np.asarray(sent_vec) / numw


V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))   

After converting text into vectors we can divide data into training and testing datasets and attach class labels.

X_train = V[0:6]
X_test = V[6:9] 
          
Y_train = [0, 0, 0, 0, 1,1]
Y_test =  [0,1,1]   

Text Classification

Now it is time to load data to MLP Classifier to do text classification.

from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier(alpha = 0.7, max_iter=400) 
classifier.fit(X_train, Y_train)

df_results = pd.DataFrame(data=np.zeros(shape=(1,3)), columns = ['classifier', 'train_score', 'test_score'] )
train_score = classifier.score(X_train, Y_train)
test_score = classifier.score(X_test, Y_test)

print  (classifier.predict_proba(X_test))
print  (classifier.predict(X_test))

df_results.loc[1,'classifier'] = "MLP"
df_results.loc[1,'train_score'] = train_score
df_results.loc[1,'test_score'] = test_score

print(df_results)
     
"""
Output
  classifier  train_score  test_score
         MLP          1.0         1.0
"""

In this post we learned how to use pretrained fastText word embeddings for converting text data into vector model. We also looked how to load word embeddings into machine learning algorithm. And in the end of post we looked at machine learning text classification using MLP Classifier with our fastText word embeddings. You can find full python source code and references below.

from gensim.models import KeyedVectors
import pandas as pd

model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
print (model.most_similar('desk'))

words = []
for word in model.vocab:
    words.append(word)

print("Vector components of a word: {}".format(
    model[words[0]]
))
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'machine', 'learning', 'book'],
			['one', 'more', 'new', 'book'],
	    ['this', 'is', 'about', 'machine', 'learning', 'post'],
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
          ['this', 'is', 'the', 'last', 'machine', 'learning', 'book'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'packages'],
          ['orange', 'juice', 'is', 'liquid', 'extract', 'from', 'fruit', 'on', 'orange', 'tree']]
         
import numpy as np

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
   
    return np.asarray(sent_vec) / numw

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))   
         
    
X_train = V[0:6]
X_test = V[6:9] 
Y_train = [0, 0, 0, 0, 1,1]
Y_test =  [0,1,1]    
    
    
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier(alpha = 0.7, max_iter=400) 
classifier.fit(X_train, Y_train)

df_results = pd.DataFrame(data=np.zeros(shape=(1,3)), columns = ['classifier', 'train_score', 'test_score'] )
train_score = classifier.score(X_train, Y_train)
test_score = classifier.score(X_test, Y_test)

print  (classifier.predict_proba(X_test))
print  (classifier.predict(X_test))

df_results.loc[1,'classifier'] = "MLP"
df_results.loc[1,'train_score'] = train_score
df_results.loc[1,'test_score'] = test_score
print(df_results)

References
1. fasttext.cc
2. fastText
3.
Classification with scikit learn
4. english-vectors
5. How to use pre-trained word vectors from Facebook’s fastText
6. What is the main difference between word2vec and fastText?

How to Convert Word to Vector with GloVe and Python

In the previous post we looked at Vector Representation of Text with word embeddings using word2vec. Another approach that can be used to convert word to vector is to use GloVe – Global Vectors for Word Representation. Per documentation from home page of GloVe [1] “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus”. Thus we can convert word to vector using GloVe.

At this post we will look how to use pretrained GloVe data file that can be downloaded from [1].
word embeddings GloVe We will look how to get word vector representation from this downloaded datafile. We will also look how to get nearest words. Why do we need vector representation of text? Because this is what we input to machine learning or data science algorithms – we feed numerical vectors to algorithms such as text classification, machine learning clustering or other text analytics algorithms.

Loading Glove Datafile

The code that I put here is based on some examples that I found on StackOverflow [2].

So first you need to open the file and load data into the model. Then you can get the vector representation and other things.

Below is the full source code for glove python script:

file = "C:\\Users\\glove\\glove.6B.50d.txt"
import numpy as np
def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
   
    
    with open(gloveFile, encoding="utf8" ) as f:
       content = f.readlines()
    model = {}
    for line in content:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model
    
    
model= loadGloveModel(file)   

print (model['hello'])

"""
Below is the output of the above code
Loading Glove Model
Done. 400000  words loaded!
[-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
  0.67204 ]
"""  

So we got numerical representation of word ‘hello’.
We can use also pandas to load GloVe file. Below are functions for loading with pandas and getting vector information.

import pandas as pd
import csv

words = pd.read_table(file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)


def vec(w):
  return words.loc[w].as_matrix()
 

print (vec('hello'))    #this will print same as print (model['hello'])  before
 

Finding Closest Word or Words

Now how do we find closest word to word “table”? We iterate through pandas dataframe, find deltas and then use numpy argmin function.
The closest word to some word will be always this word itself (as delta = 0) so I needed to drop the word ‘table’ and also next closest word ‘tables’. The final output for the closest word was “place”

words = words.drop("table", axis=0)  
words = words.drop("tables", axis=0)  

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name 


print (find_closest_word(model['table']))
#output:  place

#If we want retrieve more than one closest words here is the function:

def find_N_closest_word(v, N, words):
  Nwords=[]  
  for w in range(N):  
     diff = words.as_matrix() - v
     delta = np.sum(diff * diff, axis=1)
     i = np.argmin(delta)
     Nwords.append(words.iloc[i].name)
     words = words.drop(words.iloc[i].name, axis=0)
    
  return Nwords
  
  
print (find_N_closest_word(model['table'], 10, words)) 

#Output:
#['table', 'tables', 'place', 'sit', 'set', 'hold', 'setting', 'here', 'placing', 'bottom']

We can also use gensim word2vec library functionalities after we load GloVe file.

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file=file, word2vec_output_file="gensim_glove_vectors.txt")

###Finally, read the word2vec txt to a gensim model using KeyedVectors:

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

Difference between word2vec and GloVe

Both models learn geometrical encodings (vectors) of words from their co-occurrence information. They differ in the way how they learn this information. word2vec is using a “predictive” model (feed-forward neural network), whereas GloVe is using a “count-based” model (dimensionality reduction on the co-occurrence counts matrix). [3]

I hope you enjoyed reading this post about how to convert word to vector with GloVe and python. If you have any tips or anything else to add, please leave a comment below.

References
1. GloVe: Global Vectors for Word Representation
2. Load pretrained glove vectors in python
3. How is GloVe different from word2vec
4. Don’t count, predict! A systematic comparison of
context-counting vs. context-predicting semantic vectors

5. Words Embeddings

Vector Representation of Text – Word Embeddings with word2vec

Computers can not understand the text. We need to convert text into numerical vectors before any kind of text analysis like text clustering or classification. The classical well known model is bag of words (BOW). With this model we have one dimension per each unique word in vocabulary. We represent the document as vector with 0s and 1s. We use 1 if the word from vocabulary exists in the document.

Recently new models with word embedding in machine learning gained popularity since they allow to keep semantic information. With word embeddings we can get lower dimensionality than with BOW model. There are several such models for example Glove, word2vec that are used in machine learning text analysis.

Many examples on the web are showing how to operate at word level with word embeddings methods but in the most cases we are working at the document level (sentence, paragraph or document) To get understanding how it can be used for text analytics I decided to take word2vect and create small practical example.

In this post you will learn how to use word embedding word2vect method for converting sentence into numerical vector. The same technique can be used for text with more than one sentence. We will create python script that converts sentences into numerical vectors.

Input

For the input for this script we will use hard coded in the script sentences. The sentences in the script will be already tokenized. Below you can find sentences for our input. Note that sentences 6 and 7 are more distinguish from other sentences.

1 [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
2 ['this', 'is',  'another', 'book'],
3 ['one', 'more', 'book'],
4 ['this', 'is', 'the', 'new', 'post'],
5 ['this', 'is', 'about', 'machine', 'learning', 'post'], 
6 ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
7 ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
8 ['and', 'this', 'is', 'the', 'last', 'post']]

With word2vec you have two options:
1. Create your own word2vec
2. Use pretrained data from Google

From word to sentence

Each word in word embeddings is represented by the vector. But let’s say we are working with tweets from twitter and need to know how similar or dissimilar are tweets? So we need to have vector representation of whole text in tweet. To achieve this we can do average word embeddings for each word in sentence (or tweet or paragraph) The idea come from paper [1]. In this paper the authors averaged word embeddings to get paragraph vector.

Source code for conversion


Below in Listing A and Listing B you can find how we can average word embeddings and get numerical vectors.
Listing A has the python source code for using own word embeddings.
Listing B has the python source code for using word embeddings from Google.
The script is taking embeddings from local file that was downloaded from Google before. You can find in this post Using Pretrained Word Embeddings in Machine Learning more details on downloading word embeddings from Google.

When averaging embeddings I was using 50 first dimensions. This is the minimal number that was used in one of the papers. The recommendation is to use between 100-400 dimensions.

Analysis of Results

How do we know that our results are good? We will do here a quick check as following. We will calculate the distance (similarity measure) between vectors and will compare with our expectation. If text sentences belong to different context then we expect the distance will be more and if sentences are close together then distance will be less. Because context of sentences 6 and 7 is different from other sentences we would expect to see this difference in results.

For calculating distance we use in the script cosine measure. With cosine measure most similar will be the one that have the highest cosine value. Below are results:
Note that 0 values mean that cosine value was not calculated because there is no need to do this. ( value already calculated for example for doc21 = doc12 or the value is on diagonal )

Results from Listing A (using own web embedings)
 1   2    3    4    5    6    7    8
1[0, 0.5, 0.1, 0.5, 0.6, 0.4, 0.2, 0.4],
2[0, 0,   0.2, 0.6, 0.5, 0.2, 0.1, 0.5],
3[0, 0,   0,   0.0, 0.0, 0.0, 0.1, 0.0],
4[0, 0,   0,   0,   0.6, 0.5, 0.3, 0.7],
5[0, 0,   0,   0,   0,   0.2, 0.2, 0.6],
6[0, 0,   0,   0,   0,   0,   0.4, 0.4], 
7[0, 0,   0,   0,   0,   0,   0,   0.3], 
8[0, 0,   0,   0,   0,   0,   0,   0]


Results from Listing B (using pretrained dataset):
  1  2     3     4     5     6     7     8
1[0, 0.77, 0.33, 0.57, 0.78, 0.35, 0.37, 0.55],
2[0, 0,    0.60, 0.62, 0.51, 0.31, 0.29, 0.59],
3[0, 0,    0,    0.16, 0.12, 0.18, 0.25, 0.11], 
4[0, 0,    0,    0,    0.62, 0.41, 0.37, 0.89],
5[0, 0,    0,    0,    0,    0.35, 0.27, 0.61], 
6[0, 0,    0,    0,    0,    0,    0.81, 0.37], 
7[0, 0,    0,    0,    0,    0,    0,    0.32],
8[0, 0,    0,    0,    0,    0,    0,    0]]

Looking at results we can see that our expectations are confirmed especially on results where pretrained word embeddings were used. Sentences 6,7 have low similarity with other sentences but have high similarity 0.81 when we compare sentence 6 with 7.

Conclusion

In this post we considered how to represent document (sentence, paragraph) as vector of numbers using word embeddings model word2vec. We looked at 2 possible ways – using own embeddings and using embeddings from Google. We got results for our small example and we were able to evaluate the results.

Now we can feed vector representation of text into machine learning text analysis algorithms.

Here are a few posts where you can find how to feed word2vec word embedding in text clustering algorithms such as kmeans from NLTK and sklearn libraries and how to plot data with TSNE :
K Means Clustering Example with Word2Vec in Data Mining or Machine Learning
Text Clustering with Word Embedding in Machine Learning

Below are few links for different word embedding models that are also widely used:
GloVe –
How to Convert Word to Vector with GloVe and Python
fastText –
FastText Word Embeddings

I hope you enjoyed this post about representing text as vector using word2vec. If you have any tips or anything else to add, please leave a comment in the reply box.

Listing A. Here is the python source code for using own word embeddings

from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'], 
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
			['and', 'this', 'is', 'the', 'last', 'post']]




model = Word2Vec(sentences, min_count=1, size=100)
vocab = model.vocab.keys()
wordsInVocab = len(vocab)
print (model.similarity('post', 'book'))


import numpy as np

def sent_vectorizer(sent, model):
    sent_vec = np.zeros(100)
    numw = 0
    for w in sent:
        try:
            sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
    return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))

from numpy import dot
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))] 

for i in range (len(V) - 1):
    for j in range(i+1, len(V)):
           results[i][j] = dot(V[i],V[j])/norm(V[i])/norm(V[j])


print (results)

Listing B. Here is the python source code for using word embeddings from Google.

import gensim
model = gensim.models.Word2Vec.load_word2vec_format('C:\\Users\\Downloads\\GoogleNews-vectors-negative300.bin', binary=True)  

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'], 
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
			['and', 'this', 'is', 'the', 'last', 'post']]


vocab = model.vocab.keys()
wordsInVocab = len(vocab)

import numpy as np

def sent_vectorizer(sent, model):
    sent_vec = np.zeros(50)
    numw = 0
    for w in sent:
        try:
            vc=model[w]
            vc=vc[0:50]
           
            sent_vec = np.add(sent_vec, vc) 
            numw+=1
        except:
            pass
    return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))] 

for i in range (len(V) - 1):
    for j in range(i+1, len(V)):
           
       NVI=norm(V[i])
       NVJ=norm(V[j])
           
       dotVij =0
       NVI=0
       for x in range(50):
           NVI=NVI +  V[i][x]*V[i][x]
           
       NVJ=0
       for x in range(50):
           NVJ=NVJ +  V[j][x]*V[j][x]
            
       for x in range(50):
      
               dotVij = dotVij + V[i][x] * V[j][x]
         
      
       results[i][j] = dotVij / (NVI*NVJ) 

print (results)

References
1. Document Embedding with Paragraph Vectors

Using Pretrained Word Embeddings in Machine Learning

In this post you will learn how to use pre-trained word embeddings in machine learning. Google provides News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors).

Download file from this link word2vec-GoogleNews-vectors and save it in some local folder. Open it with zip program and extract the .bin file. So instead of file GoogleNews-vectors-negative300.bin.gz you will have the file GoogleNews-vectors-negative300.bin

Now you can use the below snippet to load this file using gensim. Change the file path to actual file folder where you saved the file in the previous step.

Gensim
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. It is Python framework for fast Vector Space Modelling.

The below python code snippet demonstrates how to load pretrained Google file into the model and then query model for example for similarity between word.
# -*- coding: utf-8 -*-

import gensim

model = gensim.models.Word2Vec.load_word2vec_format('C:\\Users\\GoogleNews-vectors-negative300.bin', binary=True)  

vocab = model.vocab.keys()
wordsInVocab = len(vocab)
print (wordsInVocab)
print (model.similarity('this', 'is'))
print (model.similarity('post', 'book'))

Output from the above code:
3000000
0.407970363878
0.0572043891977

You can do all other things same way as if you would use own trained word embeddings. The Google file however is big, it is 1.5 GB original size, and unzipped it has 3.3GB. On my 6GB RAM laptop it took a while to run the below code. But it run it. However some other commands I was not able to run.

See this post K Means Clustering Example with Word2Vec which is showing embedding in machine learning algorithm. Here Word2Vec model will be feeded into several k-means clustering algorithms from NLTK and Scikit-learn libraries.

GloVe and fastText Word Embedding in Machine Learning

Word2vec is not the the only word embedding available for use. Below are the few links for other word embeddings.
Here How to Convert Word to Vector with GloVe and Python you will find how to convert word to vector with GloVe – Global Vectors for Word Representation. Detailed example is shown how to use pretrained GloVe data file that can be downloaded.

And one more link is here FastText Word Embeddings for Text Classification with MLP and Python In this post you will discover fastText word embeddings – how to load pretrained fastText, get text embeddings and use it in document classification example.

1. Google’s trained Word2Vec model in Python
2. word2vec-GoogleNews-vectors
3. gensim 3.1.0