Text Clustering with doc2vec Word Embedding Machine Learning Model

In this post we will look at doc2vec word embedding model, how to build it or use pretrained embedding file. For practical example we will explore how to do text clustering with doc2vec model.

Doc2vec

Doc2vec is an unsupervised computer algorithm to generate vectors for sentence/paragraphs/documents. The algorithm is an adaptation of word2vec which can generate vectors for words. Below you can see frameworks for learning word vector word2vec (left side) and paragraph vector doc2vec (right side). For learning doc2vec, the paragraph vector was added to represent the missing information from the current context and to act as a memory of the topic of the paragraph. [1]

Word Embeddings Machine Learning Frameworks: word2vec and doc2vec

If you need information about word2vec here are some posts:
word2vec –
Vector Representation of Text – Word Embeddings with word2vec
word2vec application –
Text Analytics Techniques with Embeddings
Using Pretrained Word Embeddinigs in Machine Learning
K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

The vectors generated by doc2vec can be used for tasks like finding similarity between sentences / paragraphs / documents. [2] With doc2vec you can get vector for sentence or paragraph out of model without additional computations as you would do it in word2vec, for example here we used function to go from word level to sentence level:
Text Clustering with Word Embedding in Machine Learning

word2vec was very successful and it created idea to convert many other specific texts to vector. It can called “anything to vector”. So there are many different word embedding models that like doc2vec can convert more than one word to numeric vector. [3][4] Here are few examples:

tweet2vec Tweet2Vec: Character-Based Distributed Representations for Social Media
lda2vec Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec. Here is proposed model that learns dense word vectors jointly with Dirichlet-distributed latent document-level mixtures of topic vectors.
Topic2Vec Learning Distributed Representations of Topics
Med2vec Multi-layer Representation Learning for Medical Concepts
The list can go on. In the next section we will look how to load doc2vec and use for text clustering.

Building doc2vec Model

Here is the example for converting word paragraph to vector using own built doc2vec model. The example is taken from [5].

The script consists of the following main steps:

  • build model using own text
  • save model to file
  • load model from this file
  • infer vector representation

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

print (common_texts)

"""
output:
[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
"""


documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]

print (documents)
"""
output
[TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=[1]), TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=[2]), TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), TaggedDocument(words=['graph', 'trees'], tags=[6]), TaggedDocument(words=['graph', 'minors', 'trees'], tags=[7]), TaggedDocument(words=['graph', 'minors', 'survey'], tags=[8])]

"""

model = Doc2Vec(documents, size=5, window=2, min_count=1, workers=4)
#Persist a model to disk:

from gensim.test.utils import get_tmpfile
fname = get_tmpfile("my_doc2vec_model")

print (fname)
#output: C:\Users\userABC\AppData\Local\Temp\my_doc2vec_model

#load model from saved file
model.save(fname)
model = Doc2Vec.load(fname)  
# you can continue training with the loaded model!
#If you’re finished training a model (=no more updates, only querying, reduce memory usage), you can do:

model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

#Infer vector for a new document:
#Here our text paragraph just 2 words
vector = model.infer_vector(["system", "response"])
print (vector)

"""
output

[-0.08390492  0.01629403 -0.08274432  0.06739668 -0.07021132]
 
 """

Using Pretrained doc2vec Model

We can skip building embedding file step and use already built file. Here is an example how to do coding with pretrained word embedding file for representing test docs as vectors. The script is based on [6].

The below script is using pretrained on Wikipedia data doc2vec model from this location

Here is the link where you can find links to different pre-trained doc2vec and word2vec models and additional information.

You need to download zip file, unzip , put 3 files at some folder and provide path in the script. In this example it is “doc2vec/doc2vec.bin”

The main steps of the below script consist of just load doc2vec model and infer vectors.


import gensim.models as g
import codecs

model="doc2vec/doc2vec.bin"
test_docs="data/test_docs.txt"
output_file="data/test_vectors.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

#load model
m = g.Doc2Vec.load(model)
test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]

#infer test vectors
output = open(output_file, "w")
for d in test_docs:
    output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
output.flush()
output.close()


"""
output file
0.03772797 0.07995503 -0.1598981 0.04817521 0.033129826 -0.06923918 0.12705861 -0.06330753 .........
"""

So we got output file with vectors (one per each paragraph). That means we successfully converted our text to vectors. Now we can use it for different machine learning algorithms such as text classification, text clustering and many other. Next section will show example for Birch clustering algorithm with word embeddings.

Using Pretrained doc2vec Model for Text Clustering (Birch Algorithm)

In this example we use Birch clustering algorithm for clustering text data file from [6]
Birch is unsupervised algorithm that is used for hierarchical clustering. An advantage of this algorithm is its ability to incrementally and dynamically cluster incoming data [7]

We use the following steps here:

  • Load doc2vec model
  • Load text docs that will be clustered
  • Convert docs to vectors (infer_vector)
  • Do clustering
from sklearn import metrics

import gensim.models as g
import codecs


model="doc2vec/doc2vec.bin"
test_docs="data/test_docs.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

#load model
m = g.Doc2Vec.load(model)
test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]

print (test_docs)
"""
[['the', 'cardigan', 'welsh', 'corgi'........
"""

X=[]
for d in test_docs:
    
    X.append( m.infer_vector(d, alpha=start_alpha, steps=infer_epoch) )
   

k=3

from sklearn.cluster import Birch

brc = Birch(branching_factor=50, n_clusters=k, threshold=0.1, compute_labels=True)
brc.fit(X)

clusters = brc.predict(X)

labels = brc.labels_


print ("Clusters: ")
print (clusters)


silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')

print ("Silhouette_score: ")
print (silhouette_score)

"""
Clusters: 
[1 0 0 1 1 2 1 0 1 1]
Silhouette_score: 
0.17644188
"""

If you want to get some test with text clustering and word embeddings here is the online demo Currently it is using word2vec and glove models and k means clustering algorithm. Select ‘Text Clustering’ option and scroll down to input data.

Conclusion

We looked what is doc2vec is, we investigated 2 ways to load this model: we can create embedding model file from our text or use pretrained embedding file. We applied doc2vec to do Birch algorithm for text clustering. In case we need to work with paragraph / sentences / docs, doc2vec can simplify word embedding for converting text to vectors.

References
1. Distributed Representations of Sentences and Documents
2. What is doc2vec?
3. Anything to Vec
4. Anything2Vec, or How Word2Vec Conquered NLP
5. models.doc2vec – Doc2vec paragraph embeddings
6. doc2vec
7. BIRCH

Text Clustering with Word Embedding in Machine Learning


Text clustering is widely used in many applications such as recommender systems, sentiment analysis, topic selection, user segmentation. Word embeddings (for example word2vec) allow to exploit ordering
of the words and semantics information from the text corpus. In this blog you can find several posts dedicated different word embedding models:

GloVe –
How to Convert Word to Vector with GloVe and Python
fastText –
FastText Word Embeddings
word2vec –
Vector Representation of Text – Word Embeddings with word2vec
word2vec application –
Text Analytics Techniques with Embeddings
Using Pretrained Word Embeddinigs in Machine Learning
K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

In contrast to last post from the above list, in this post we will discover how to do text clustering with word embeddings at sentence (phrase) level. The sentence could be a few words, phrase or paragraph like tweet. For examples we have 1000 of tweets and want to group in several clusters. So each cluster would contain one or more tweets.

Data

Our data will be the set of sentences (phrases) containing 2 topics as below:
Note: I highlighted in bold 3 sentences on weather topic, all other sentences have totally different topic.
sentences = [[‘this’, ‘is’, ‘the’, ‘one’,’good’, ‘machine’, ‘learning’, ‘book’],
[‘this’, ‘is’, ‘another’, ‘book’],
[‘one’, ‘more’, ‘book’],
[‘weather’, ‘rain’, ‘snow’],
[‘yesterday’, ‘weather’, ‘snow’],
[‘forecast’, ‘tomorrow’, ‘rain’, ‘snow’],

[‘this’, ‘is’, ‘the’, ‘new’, ‘post’],
[‘this’, ‘is’, ‘about’, ‘more’, ‘machine’, ‘learning’, ‘post’],
[‘and’, ‘this’, ‘is’, ‘the’, ‘one’, ‘last’, ‘post’, ‘book’]]

Word Embedding Method

For embeddings we will use gensim word2vec model. There is also doc2vec model – but we will use it at next post.
With the need to do text clustering at sentence level there will be one extra step for moving from word level to sentence level. For each sentence from the set of sentences, word embedding of each word is summed and in the end divided by number of words in the sentence. So we are getting average of all word embeddings for each sentence and use them as we would use embeddings at word level – feeding to machine learning clustering algorithm such k-means.

Here is the example of the function that doing this:

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
    
    return np.asarray(sent_vec) / numw

Now we will use text clustering Kmeans algorithm with word2vec model for embeddings. For kmeans algorithm we will use 2 separate implementations with different libraries NLTK for KMeansClusterer and sklearn for cluster. This was described in previous posts (see the list above).

The code for this article can be found in the end of this post. We use 2 for number of clusters in both k means text clustering algorithms.
Additionally we will plot data using tSNE.

Output

Below are results

[1, 1, 1, 0, 0, 0, 1, 1, 1]

Cluster id and sentence:
1:[‘this’, ‘is’, ‘the’, ‘one’, ‘good’, ‘machine’, ‘learning’, ‘book’]
1:[‘this’, ‘is’, ‘another’, ‘book’]
1:[‘one’, ‘more’, ‘book’]
0:[‘weather’, ‘rain’, ‘snow’]
0:[‘yesterday’, ‘weather’, ‘snow’]
0:[‘forecast’, ‘tomorrow’, ‘rain’, ‘snow’]

1:[‘this’, ‘is’, ‘the’, ‘new’, ‘post’]
1:[‘this’, ‘is’, ‘about’, ‘more’, ‘machine’, ‘learning’, ‘post’]
1:[‘and’, ‘this’, ‘is’, ‘the’, ‘one’, ‘last’, ‘post’, ‘book’]

Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):
-0.0008175040203510163
Silhouette_score:
0.3498247

Cluster id and sentence:
1 [‘this’, ‘is’, ‘the’, ‘one’, ‘good’, ‘machine’, ‘learning’, ‘book’]
1 [‘this’, ‘is’, ‘another’, ‘book’]
1 [‘one’, ‘more’, ‘book’]
0 [‘weather’, ‘rain’, ‘snow’]
0 [‘yesterday’, ‘weather’, ‘snow’]
0 [‘forecast’, ‘tomorrow’, ‘rain’, ‘snow’]

1 [‘this’, ‘is’, ‘the’, ‘new’, ‘post’]
1 [‘this’, ‘is’, ‘about’, ‘more’, ‘machine’, ‘learning’, ‘post’]
1 [‘and’, ‘this’, ‘is’, ‘the’, ‘one’, ‘last’, ‘post’, ‘book’]

Results of text clustering
Results of text clustering

We see that the data were clustered according to our expectation – different sentences by topic appeared to different clusters. Thus we learned how to do clustering algorithms in data mining or machine learning with word embeddings at sentence level. Here we used kmeans clustering and word2vec embedding model. We created additional function to go from word embeddings to sentence embeddings level. In the next post we will use doc2vec and will not need this function.

Below is full source code python script.

from gensim.models import Word2Vec
 
from nltk.cluster import KMeansClusterer
import nltk
import numpy as np 
 
from sklearn import cluster
from sklearn import metrics
 
# training data
 
sentences = [['this', 'is', 'the', 'one','good', 'machine', 'learning', 'book'],
            ['this', 'is',  'another', 'book'],
            ['one', 'more', 'book'],
            ['weather', 'rain', 'snow'],
            ['yesterday', 'weather', 'snow'],
            ['forecast', 'tomorrow', 'rain', 'snow'],
            ['this', 'is', 'the', 'new', 'post'],
            ['this', 'is', 'about', 'more', 'machine', 'learning', 'post'],  
            ['and', 'this', 'is', 'the', 'one', 'last', 'post', 'book']]
 
 

model = Word2Vec(sentences, min_count=1)

 
def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
    
    return np.asarray(sent_vec) / numw
 
 
X=[]
for sentence in sentences:
    X.append(sent_vectorizer(sentence, model))   

print ("========================")
print (X)


 

# note with some version you would need use this (without wv) 
#  model[model.vocab] 
print (model[model.wv.vocab])


 

print (model.similarity('post', 'book'))
print (model.most_similar(positive=['machine'], negative=[], topn=2))
 
 

 
 
NUM_CLUSTERS=2
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
print (assigned_clusters)
 
 
 
for index, sentence in enumerate(sentences):    
    print (str(assigned_clusters[index]) + ":" + str(sentence))

    
    
    
kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(X)
 
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
 
print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)
 
print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(X))
 
silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')
 
print ("Silhouette_score: ")
print (silhouette_score)


import matplotlib.pyplot as plt

from sklearn.manifold import TSNE

model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)

Y=model.fit_transform(X)


plt.scatter(Y[:, 0], Y[:, 1], c=assigned_clusters, s=290,alpha=.5)


for j in range(len(sentences)):    
   plt.annotate(assigned_clusters[j],xy=(Y[j][0], Y[j][1]),xytext=(0,0),textcoords='offset points')
   print ("%s %s" % (assigned_clusters[j],  sentences[j]))


plt.show()

Topic Modeling Python and Textacy Example

Topic modeling is automatic discovering the abstract “topics” that occur in a collection of documents.[1] It can be used for providing more informative view of search results, quick overview for set of documents or some other services.

Textacy

In this post we will look at topic modeling with textacy. Textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library.
It can flexibly tokenize and vectorize documents and corpora, then train, interpret, and visualize topic models using LSA, LDA, or NMF methods. [2]
Textacy is less known than other python libraries such as NLTK, SpaCY, TextBlob [3] But it looks very promising as it’s built on the top of spaCY.

In this post we will use textacy for the following task. We have group of documents and we want extract topics out of this set of documents. We will use 20 Newsgroups dataset as the source of documents.

Code Structure

Our code consist of the following steps:
Get data. We will use only 2 groups (alt.atheism’, ‘soc.religion.christian’).
Tokenize and remove some not needed characters or stopwords.
Vectorize.
Extract Topics. Here we do actual topic modeling. We use Non-negative Matrix Factorization method. (NMF)
Output graph of terms – topic matrix.

Output

Below is the final output plot.

Topic modeling with textacy
Topic modeling with textacy

Looking at output graph we can see term distribution over the topics. We identified more than 2 topics. For example topic 2 is associated with atheism, while topic 1 is associated with God, religion.

While better data preparation is needed to remove few more non meaningful words, the example still showing that to do topic modeling with textacy is much easy than with some other modes (for example gensim). This is because it has ability to do many things that you need do after NLP versus just do NLP and allow user then add additional data views, heatmaps or diagrams.

Here are few links with topic modeling using LDA and gensim (not using textacy). The posts demonstrate that it is required more coding comparing with textacy.
Topic Extraction from Blog Posts with LSI , LDA and Python
Data Visualization – Visualizing an LDA Model using Python

Source Code

Below is python full source code.

categories = ['alt.atheism', 'soc.religion.christian'] 

#Loading the data set - training data.
from sklearn.datasets import fetch_20newsgroups
 
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, categories=categories, remove=('headers', 'footers', 'quotes'))
 
# You can check the target names (categories) and some data files by following commands.
print (newsgroups_train.target_names) #prints all the categories
print("\n".join(newsgroups_train.data[0].split("\n")[:3])) #prints first line of the first data file
print (newsgroups_train.target_names)
print (len(newsgroups_train.data))
 
texts = []
 
labels=newsgroups_train.target
texts = newsgroups_train.data

from nltk.corpus import stopwords

import textacy
from textacy.vsm import Vectorizer

terms_list=[[tok  for tok in doc.split() if tok not in stopwords.words('english') ] for doc in texts]
 

count=0            
for doc in terms_list:
 for word in doc:   
   print (word) 
   if word == "|>" or word == "|>" or word == "_" or word == "-" or word == "#":
         terms_list[count].remove (word)
   if word == "=":
         terms_list[count].remove (word)
   if word == ":":
         terms_list[count].remove (word)    
   if word == "_/":
         terms_list[count].remove (word)  
   if word == "I" or word == "A":
         terms_list[count].remove (word)
   if word == "The" or word == "But" or word=="If" or word=="It":
         terms_list[count].remove (word)       
 count=count+1
      

print ("=====================terms_list===============================")
print (terms_list)


vectorizer = Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
doc_term_matrix = vectorizer.fit_transform(terms_list)


print ("========================doc_term_matrix)=======================")
print (doc_term_matrix)



#initialize and train a topic model:
model = textacy.tm.TopicModel('nmf', n_topics=20)
model.fit(doc_term_matrix)

print ("======================model=================")
print (model)

doc_topic_matrix = model.transform(doc_term_matrix)
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1]):
          print('topic', topic_idx, ':', '   '.join(top_terms))

for i, val in enumerate(model.topic_weights(doc_topic_matrix)):
     print(i, val)
     
     
print   ("doc_term_matrix")     
print   (doc_term_matrix)   
print ("vectorizer.id_to_term")
print (vectorizer.id_to_term)
         

model.termite_plot(doc_term_matrix, vectorizer.id_to_term, topics=-1,  n_terms=25, sort_terms_by='seriation')  
model.save('nmf-10topics.pkl')        


References
1.Topic Model
2.textacy: NLP, before and after spaCy
3.5 Heroic Python NLP Libraries

Text Mining Techniques for Search Results Clustering

Text search box can be found almost in every web based application that has text data. We use search feature when we are looking for customer data, jobs descriptions, book reviews or some other information. Simple keyword matching can be enough in some small tasks. However when we have many results something better than keyword match would be very helpful. Instead of going through a lot of results we would get results grouped by topic with a nice summary of topics. It would allow to see information at first sight.

In this post we will look in some machine learning algorithms, applications and frameworks that can analyze output of search function and provide useful additional information for search results.

Machine Learning Clustering for Search Results

Search results clustering problem is defined as an automatic, on-line grouping of similar documents in a search results list returned from a search engine. [1] Carrot2 is the tool that was built to solve this problem.
Carrot2 is Open Source Framework for building Search Results Clustering Engine. This tool can do search, cluster and visualize clusters. Which is very cool. I was not able to find similar like this tool in the range of open source projects. If you are aware of such tool, please suggest in the comment box.

Below are screenshots of clustering search results from Carrot2

Clustering search results with Carrot2
Clustering search results with Carrot2
Aduna cluster map visualization clusters
Aduna cluster map visualization clusters with Carrot2

The following algorithms are behind Carrot2 tool:
Lingo algorithm constructs a “term-document matrix” where each snippet gets a column, each word a row and the values are the frequency of that word in that snippet. It then applies a matrix factorization called singular value decomposition or SVD. [3]

Suffix Tree Clustering (STC) uses the generalised suffix tree data structure, to efficiently build a list of the most frequently used phrases in the snippets from the search results. [3]

Topic modelling

Topic modelling is another approach that is used to identify which topic is discussed in documents or text snippets provided by search function. There are several methods like LSA, pLSA, LDA [11]

Comprehensive overview of Topic Modeling and its associated techniques is described in [12]

Topic modeling can be represented via below diagram. Our goal is identify topics given documents with the words

Topic modeling diagram
Topic modeling diagram

Below is plate notation of LDA model.

Plate notation of LDA model
Plate notation of LDA model

Plate notation representing the LDA model. [19]
αlpha is the parameter of the Dirichlet prior on the per-document topic distributions,
βeta is the parameter of the Dirichlet prior on the per-topic word distribution,
p is the topic distribution for document m,
Z is the topic for the n-th word in document m, and
W is the specific word.

We can use different NLP libraries (NLTK, spaCY, gensim, textacy) for topic modeling.
Here is the example of topic modeling with textacy python library:
Topic Modeling Python and Textacy Example

Here are examples of topic modeling with gensim library:
Topic Extraction from Blog Posts with LSI , LDA and Python
Data Visualization – Visualizing an LDA Model using Python

Using Word Embeddings

Word embeddings like gensim, word2vec, glove showed very good results in NLP and are widely used now. This is also used for search results clustering. The first step would be create model for example gensim. In the next step text data are converted to vector representation. Words embedding improve preformance by leveraging information on how words are semantically correlated to each other [7][10]

Neural Topic Model (NTM) and Other Approaches

Below are some other approaches that can be used for topic modeling for search results organizing.
Neural topic modeling – combines a neural network with a latent topic model. [14]
Topic modeling with Deep Belief Nets is described in [17]. The concept of the method is to load bag-of-words (BOW) and produce a strong latent representation that will then be used for a content based recommender system. The authors report that model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classification tasks.

Thus we looked at different techniques for search results clustering. In the future posts we will implement some of them. What machine learning methods do you use for presenting search results? I would love to hear.

References
1. Lingo Search Results Clustering Algorithm
2. Carrot2 Algorithms
3. Carrot2
4. Apache SOLR and Carrot2 integration strategies
5. Topical Clustering of Search Results
6. K-means clustering for text dataset
7. Document Clustering using Doc2Vec/word2vec
8 Automatic Topic Clustering Using Doc2Vec
9. Search Results Clustering Algorithm
10. LDA2vec: Word Embeddings in Topic Models
11. Topic Modelling in Python with NLTK and Gensim
12. Topic Modeling with LSA, PLSA, LDA & lda2Vec
13. Text Summarization with Amazon Reviews
14. A Hybrid Neural Network-Latent Topic Model
15. docluster
16. Deep Belief Nets for Topic Modeling
17. Modeling Documents with a Deep Boltzmann Machine
18. Beginners guide to topic modeling in python
19. Latent Dirichlet allocation

Text Classification of Different Datasets with CNN Convolutional Neural Network and Python

In this post we explore machine learning text classification of 3 text datasets using CNN Convolutional Neural Network in Keras and python. As reported on papers and blogs over the web, convolutional neural networks give good results in text classification.

Datasets

We will use the following datasets:
1. 20 newsgroups text dataset that is available from scikit learn here.
2. Dataset of web pages. The web documents are downloaded manually from web and belong to two categories : text mining or hidden markov models (HMM). This is small dataset that consists only of 20 pages for text mining and 11 pages for HMM group.
3. Datasets of tweets about Year Resolutions, obtained from data.world/crowdflower here.

Convolutional Neural Network Architecture

Our CNN will be based on Richard Liao code from [1], [2]. We use convolutional neural network that is built with different layers such as Embedding , Conv1D, Flatten, Dense. For embedding we utilize pretrained glove dataset that can be downloaded from web.

The data flow diagram with layers used is shown below.

CNN diagram
CNN diagram

Here is the code for obtaining convolutional neural net diagram like this. Insert it after model.fit (…) line. It requires installation of pydot and graphviz however.

model.fit(.....)

import pydot
pydot.find_graphviz = lambda: True
print (pydot.find_graphviz())

import os
os.environ["PATH"] += os.pathsep + "C:\\Program Files (x86)\\Graphviz2.38\\bin"

from keras.utils import plot_model
plot_model(model, to_file='model.png')

1D Convolution

In our neural net convolution is performed in several 1 dimensional convolution layers (Conv1D)
1D convolution means that just 1-direction is used to calculate convolution.[3]
For example:
input = [1,1,1,1,1], filter = [0.25,0.5,0.25], output = [1,1,1,1,1]
output-shape is 1D array
We can also apply 1D convolution for 2D data matrix – as we use in text classification.
The good explanation of convolution in text can be found in [6]

Text Classifiction of 20 Newsgroups Text Dataset

For this dataset we use only 2 categories. The script is provided here The accuracy of network is 87%. Trained on 864 samples, validate on 215 samples.
Summary of run: loss: 0.6205 – acc: 0.6632 – val_loss: 0.5122 – val_acc: 0.8651

Document classification of Web Pages.

Here we use also 2 categories. Python script is provided here.

Web page were manually downloaded from web and saved locally in two folders, one for each category. The script is loading web page files from locale storage. Next is preprocessing step to remove web tags but keep text content. Here is the function for this:

def get_only_text_from_html_doc(page):
 """ 
  return the title and the text of the article
 """
 
 soup = BeautifulSoup(page, "lxml")
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text + " " + text  

Accuracy on this dataset was 100% but was not consistent. In some other runs the result was only 83%.
Trained on 25 samples, validate on 6 samples.
Summary of run – loss: 0.0096 – acc: 1.0000 – val_loss: 0.0870 – val_acc: 1.0000

Text Classification of Tweet Dataset

The script is provided here.
Here is the accuracy was 93%. Trained on 4010 samples, validate on 1002 samples.
Summary of run – loss: 0.0193 – acc: 0.9958 – val_loss: 0.6690 – val_acc: 0.9281.

Conclusion

We learned how to do text classification for 3 different types of text datasets (Newsgroups, tweets, web documents). For text classification we used Convolutional Neural Network python and on all 3 datasets we got good performance on accuracy.

References

1. Text Classification, Part I – Convolutional Networks
2. textClassifierConv
3. What do you mean by 1D, 2D and 3D Convolutions in CNN?
4.How to implement Sentiment Analysis using word embedding and Convolutional Neural Networks on Keras.
5. Understanding Convolutional Neural Networks for NLP
6. Understanding Convolutions in Text
7. Recurrent Neural Networks I

Automatic Text Summarization Online

In the previous post Automatic Text Summarization with Python I showed how to use different python libraries for text summarization. Recently I added text summarization modules to online site Online Machine Learning Algorithms. So now you can play with text summarization modules online and select best summary generator. This service is the free tool that allows to run some algorithms without coding or installing software modules.

Below are the steps how to use online text summarizer models of Machine Learning Algorithms tool.

How to use online text summarizer algorithms

1. Access the link Online Machine Learning Algorithms : Online Machine Learning Algorithms tool.
Select text summarization algorithm that you want to run. There is one available with gensim and 3 with sumy python modules. We will use Luhn text summarizer algorithm. The algorithms from gensim and sumy python modules are still widely used in automatic text summarization which is part of the field of natural language processing.

Running online text summarization step1
Running online text summarization step1

2. Input the data that you want to run or click on Load Default Values. Note that you need to enter about 10 sentences at least. It will not work if you enter just few words or just one sentence.

Running online text summarization step2

3. Click Run now.

4. Click View Run Results link.

Running online text summarization -  example of output
Running online text summarization – example of output

5. Click Refresh Page button on this new page , you maybe will need click few times untill data output show up. Usually it takes less than 1 min, but it will depend how much data you need to process.
Scroll to the bottom page to see results.

If you try other text summarizers from this online tool you will see that there are some differences in generated text summaries.

End Notes

In this post, we covered how to use online text summarizer models of Machine Learning Algorithms tool available here You can run online algorithms from gensim and sumy python modules.
Feel free to provide comments or suggestions.

Document Similarity, Tokenization and Word Vectors in Python with spaCY

Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be done just within few lines. Below you will find how to get document similarity , tokenization and word vectors with spaCY.

spaCY is an open-source library designed to help you build NLP applications. It has a lot of features, we will look in this post only at few but very useful.

Document Similarity

Here is how to get document similarity:

import spacy
nlp = spacy.load('en')

doc1 = nlp(u'Hello this is document similarity calculation')
doc2 = nlp(u'Hello this is python similarity calculation')
doc3 = nlp(u'Hi there')

print (doc1.similarity(doc2)) 
print (doc2.similarity(doc3)) 
print (doc1.similarity(doc3))  

Output:
0.94
0.33
0.30

In more realistic situations we would load documents from files and would have longer text. Here is the experiment that I performed. I saved 3 articles from different random sites, two about deep learning and one about feature engineering.

def get_file_contents(filename):
  with open(filename, 'r') as filehandle:  
    filecontent = filehandle.read()
    return (filecontent) 

fn1="deep_learning1.txt"
fn2="feature_eng.txt"
fn3="deep_learning.txt"

fn1_doc=get_file_contents(fn1)
print (fn1_doc)

fn2_doc=get_file_contents(fn2)
print (fn2_doc)

fn3_doc=get_file_contents(fn3)
print (fn3_doc)
 
doc1 = nlp(fn1_doc)
doc2 = nlp(fn2_doc)
doc3 = nlp(fn3_doc)
 
print ("dl1 - features")
print (doc1.similarity(doc2)) 
print ("feature - dl")
print (doc2.similarity(doc3)) 
print ("dl1 - dl")
print (doc1.similarity(doc3)) 
 
"""
output:
dl1 - features
0.9700237040142454
feature - dl
0.9656364096761337
dl1 - dl
0.9547075478662724
"""


It was able to assign higher similarity score for documents with similar topics!

Tokenization

Another very useful and simple feature that can be done with spaCY is tokenization. Here is how easy to convert text into tokens (words):

for token in doc1:
    print(token.text)
    print (token.vector)

Word Vectors

spaCY has integrated word vectors support, while other libraries like NLTK do not have it. Below line will print word embeddings – array of 768 numbers on my environment.

 
print (token.vector)   #-  prints word vector form of token. 
print (doc1[0].vector) #- prints word vector form of first token of document.
print (doc1.vector)    #- prints mean vector form for doc1

So we looked how to use few features (similarity, tokenization and word embeddings) which are very easy to implement with spaCY. I hope you enjoyed this post. If you have any tips or anything else to add, please leave a comment below.

References
1. spaCY
2. Word Embeddings in Python with Spacy and Gensim

Automatic Text Summarization with Python

Automatic text summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. The main idea of summarization is to find a subset of data which contains the “information” of the entire set. Such techniques are widely used in industry today. [1]

In this post we will review several methods of implementing text data summarization techniques with python. We will use different python libraries.

Text Summarization with Gensim

1. Our first example is using gensim – well know python library for topic modeling. Below is the example with summarization.summarizer from gensim. This module provides functions for summarizing texts. Summarizing is based on ranks of text sentences using a variation of the TextRank algorithm. [2]

TextRank is a general purpose graph-based ranking algorithm for NLP. Essentially, it runs PageRank on a graph specially designed for a particular NLP task. For keyphrase extraction, it builds a graph using some set of text units as vertices. Edges are based on some measure of semantic or lexical similarity between the text unit vertices[1].

 
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

import requests

# getting text document from Internet
text = requests.get('http://rare-technologies.com/the_matrix_synopsis.txt').text


# getting text document from file
fname="C:\\Users\\TextRank-master\\wikipedia_deep_learning.txt"
with open(fname, 'r') as myfile:
      text=myfile.read()
    
    
#getting text document from web, below function based from 3
from bs4 import BeautifulSoup
from urllib.request import urlopen

def get_only_text(url):
 """ 
  return the title and the text of the article
  at the specified url
 """
 page = urlopen(url)
 soup = BeautifulSoup(page, "lxml")
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text, text    

 
print ('Summary:')
print (summarize(text, ratio=0.01))

print ('\nKeywords:')
print (keywords(text, ratio=0.01))

url="https://en.wikipedia.org/wiki/Deep_learning"
text = get_only_text(url)

print ('Summary:')   
print (summarize(str(text), ratio=0.01))

print ('\nKeywords:')

# higher ratio => more keywords
print (keywords(str(text), ratio=0.01))

Here is the result for link https://en.wikipedia.org/wiki/Deep_learning
Summary:
In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.[55] Later it was combined with connectionist temporal classification (CTC)[56] in stacks of LSTM RNNs.[57] In 2015, Google\’s speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.[58] In the early 2000s, CNNs processed an estimated 10% to 20% of all the checks written in the US.[59] In 2006, Hinton and Salakhutdinov showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.[60] Deep learning is part of state-of-the-art systems in various disciplines, particularly computer vision and automatic speech recognition (ASR).

Keywords:
deep learning
learned
learn
learns
layer
layered
layers
models
model
modeling
images
image
recognition
data
networks
network
trained
training
train
trains

Text Summarization using NLTK and Frequencies of Words

2. Our 2nd method is word frequency analysis provided on The Glowing Python blog [3]. Below is the example how it can be used. Note that you need FrequencySummarizer code from [3] and put it in separate file in file named FrequencySummarizer.py in the same folder. The code is using NLTK library.

 
#note FrequencySummarizer is need to be copied from
# https://glowingpython.blogspot.com/2014/09/text-summarization-with-nltk.html
# and saved as FrequencySummarizer.py in the same folder that this
# script
from FrequencySummarizer import FrequencySummarizer


from bs4 import BeautifulSoup
from urllib.request import urlopen


def get_only_text(url):
 """ 
  return the title and the text of the article
  at the specified url
 """
 
 page = urlopen(url)
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 
 print ("=====================")
 print (text)
 print ("=====================")

 return soup.title.text, text    

    
url="https://en.wikipedia.org/wiki/Deep_learning"
text = get_only_text(url)    

fs = FrequencySummarizer()
s = fs.summarize(str(text), 5)
print (s)

3. Here is the link to another example for building summarizer with python and NLTK.
This Summarizer is also based on frequency words – it creates frequency table of words – how many times each word appears in the text and assign score to each sentence depending on the words it contains and the frequency table.
The summary then built only with the sentences above a certain score threshold. [6]

Automatic Summarization Using Different Methods from Sumy

4. Our next example is based on sumy python module. Module for automatic summarization of text documents and HTML pages. Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods:

Luhn – heurestic method
Edmundson heurestic method with previous statistic research
Latent Semantic Analysis
LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
TextRank
SumBasic – Method that is often used as a baseline in the literature
KL-Sum – Method that greedily adds sentences to a summary so long as it decreases the KL Divergence. [5]

Below is the example how to use different summarizes. The usage most of them similar but for EdmundsonSummarizer we need also to enter bonus_words, stigma_words, null_words. Bonus_words are the words that we want to see in summary they are most informative and are significant words. Stigma words are unimportant words. We can use tf-idf value from information retrieval to get the list of key words.

 
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer   #found this is the best as 
# it is picking from beginning also while other skip


LANGUAGE = "english"
SENTENCES_COUNT = 10


if __name__ == "__main__":
   
    url="https://en.wikipedia.org/wiki/Deep_learning"
  
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
   

       
    print ("--LsaSummarizer--")    
    summarizer = LsaSummarizer()
    summarizer = LsaSummarizer(Stemmer(LANGUAGE))
    summarizer.stop_words = get_stop_words(LANGUAGE)
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
        
    print ("--LuhnSummarizer--")     
    summarizer = LuhnSummarizer() 
    summarizer = LsaSummarizer(Stemmer(LANGUAGE))
    summarizer.stop_words = ("I", "am", "the", "you", "are", "me", "is", "than", "that", "this",)
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
        
    print ("--EdmundsonSummarizer--")     
    summarizer = EdmundsonSummarizer() 
    words = ("deep", "learning", "neural" )
    summarizer.bonus_words = words
    
    words = ("another", "and", "some", "next",)
    summarizer.stigma_words = words
   
    
    words = ("another", "and", "some", "next",)
    summarizer.null_words = words
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)     

I hope you enjoyed this post review about automatic text summarization methods with python. If you have any tips or anything else to add, please leave a comment below.

References
1. Automatic_summarization
2. Gensim
3. text-summarization-with-nltk
4. Nullege Python Search Code
5. sumy 0.7.0
6. Build a quick Summarizer with Python and NLTK
7. text-summarization-with-gensim

FastText Word Embeddings for Text Classification with MLP and Python

Word embeddings are widely used now in many text applications or natural language processing moddels. In the previous posts I showed examples how to use word embeddings from word2vec Google, glove models for different tasks including machine learning clustering:

GloVe – How to Convert Word to Vector with GloVe and Python

word2vec – Vector Representation of Text – Word Embeddings with word2vec

word2vec application – K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

In this post we will look at fastText word embeddings in machine learning. You will learn how to load pretrained fastText, get text embeddings and do text classification. As stated on fastText site – text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies. [1]

What is fastText

fastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. [1]

fastText, is created by Facebook’s AI Research (FAIR) lab. The model is an unsupervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages.[2]

As per Quora [6], Fasttext treats each word as composed of character ngrams. So the vector for a word is made of the sum of this character n grams. Word2vec (and glove) treat words as the smallest unit to train on. This means that fastText can generate better word embeddings for rare words. Also fastText can generate word embeddings for out of vocabulary word but word2vec and glove can not do this.

Word Embeddings File

I downloaded wiki file wiki-news-300d-1M.vec from here [4], but there are some other links where you can download different data files. I found this one has smaller size so it is easy to work with it.

Basic Operations with fastText Word Embeddings

To get most similar words to some word:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
print (model.most_similar('desk'))

"""
[('desks', 0.7923153638839722), ('Desk', 0.6869951486587524), ('desk.', 0.6602819561958313), ('desk-', 0.6187258958816528), ('credenza', 0.5955315828323364), ('roll-top', 0.5875717401504517), ('rolltop', 0.5837830305099487), ('bookshelf', 0.5758029222488403), ('Desks', 0.5755287408828735), ('sofa', 0.5617446899414062)]
"""

Load words in vocabulary:

words = []
for word in model.vocab:
    words.append(word)

To see embeddings:

print("Vector components of a word: {}".format(
    model[words[0]]
))

"""
Vector components of a word: [-0.0451  0.0052  0.0776 -0.028   0.0289  0.0449  0.0117 -0.0333  0.1055
 .......................................
 -0.1368 -0.0058 -0.0713]
"""

The Problem

So here we will use fastText word embeddings for text classification of sentences. For this classification we will use sklean Multi-layer Perceptron classifier (MLP).
The sentences are prepared and inserted into script:

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'machine', 'learning', 'book'],
			['one', 'more', 'new', 'book'],
		
          ['this', 'is', 'about', 'machine', 'learning', 'post'],
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
          ['this', 'is', 'the', 'last', 'machine', 'learning', 'book'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'packages'],
          ['orange', 'juice', 'is', 'liquid', 'extract', 'from', 'fruit', 'on', 'orange', 'tree']]

The sentences belong to two classes, the labels for classes will be assigned later as 0,1. So our problem is to classify above sentences. Below is the flowchart of the program that we will use for perceptron learning algorithm example.

Text classification using word embeddings
Text classification using word embeddings

Data Preparation

I converted this text input into digital using the following code. Basically I got word embedidings and averaged all words in the sentences. The resulting vector sentence representations were saved to array V.

import numpy as np

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
   
    return np.asarray(sent_vec) / numw


V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))   

After converting text into vectors we can divide data into training and testing datasets and attach class labels.

X_train = V[0:6]
X_test = V[6:9] 
          
Y_train = [0, 0, 0, 0, 1,1]
Y_test =  [0,1,1]   

Text Classification

Now it is time to load data to MLP Classifier to do text classification.

from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier(alpha = 0.7, max_iter=400) 
classifier.fit(X_train, Y_train)

df_results = pd.DataFrame(data=np.zeros(shape=(1,3)), columns = ['classifier', 'train_score', 'test_score'] )
train_score = classifier.score(X_train, Y_train)
test_score = classifier.score(X_test, Y_test)

print  (classifier.predict_proba(X_test))
print  (classifier.predict(X_test))

df_results.loc[1,'classifier'] = "MLP"
df_results.loc[1,'train_score'] = train_score
df_results.loc[1,'test_score'] = test_score

print(df_results)
     
"""
Output
  classifier  train_score  test_score
         MLP          1.0         1.0
"""

In this post we learned how to use pretrained fastText word embeddings for converting text data into vector model. We also looked how to load word embeddings into machine learning algorithm. And in the end of post we looked at machine learning text classification using MLP Classifier with our fastText word embeddings. You can find full python source code and references below.

from gensim.models import KeyedVectors
import pandas as pd

model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
print (model.most_similar('desk'))

words = []
for word in model.vocab:
    words.append(word)

print("Vector components of a word: {}".format(
    model[words[0]]
))
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'machine', 'learning', 'book'],
			['one', 'more', 'new', 'book'],
	    ['this', 'is', 'about', 'machine', 'learning', 'post'],
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
          ['this', 'is', 'the', 'last', 'machine', 'learning', 'book'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'packages'],
          ['orange', 'juice', 'is', 'liquid', 'extract', 'from', 'fruit', 'on', 'orange', 'tree']]
         
import numpy as np

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
   
    return np.asarray(sent_vec) / numw

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))   
         
    
X_train = V[0:6]
X_test = V[6:9] 
Y_train = [0, 0, 0, 0, 1,1]
Y_test =  [0,1,1]    
    
    
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier(alpha = 0.7, max_iter=400) 
classifier.fit(X_train, Y_train)

df_results = pd.DataFrame(data=np.zeros(shape=(1,3)), columns = ['classifier', 'train_score', 'test_score'] )
train_score = classifier.score(X_train, Y_train)
test_score = classifier.score(X_test, Y_test)

print  (classifier.predict_proba(X_test))
print  (classifier.predict(X_test))

df_results.loc[1,'classifier'] = "MLP"
df_results.loc[1,'train_score'] = train_score
df_results.loc[1,'test_score'] = test_score
print(df_results)

References
1. fasttext.cc
2. fastText
3.
Classification with scikit learn
4. english-vectors
5. How to use pre-trained word vectors from Facebook’s fastText
6. What is the main difference between word2vec and fastText?

How to Convert Word to Vector with GloVe and Python

In the previous post we looked at Vector Representation of Text with word embeddings using word2vec. Another approach that can be used to convert word to vector is to use GloVe – Global Vectors for Word Representation. Per documentation from home page of GloVe [1] “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus”. Thus we can convert word to vector using GloVe.

At this post we will look how to use pretrained GloVe data file that can be downloaded from [1].
word embeddings GloVe We will look how to get word vector representation from this downloaded datafile. We will also look how to get nearest words. Why do we need vector representation of text? Because this is what we input to machine learning or data science algorithms – we feed numerical vectors to algorithms such as text classification, machine learning clustering or other text analytics algorithms.

Loading Glove Datafile

The code that I put here is based on some examples that I found on StackOverflow [2].

So first you need to open the file and load data into the model. Then you can get the vector representation and other things.

Below is the full source code for glove python script:

file = "C:\\Users\\glove\\glove.6B.50d.txt"
import numpy as np
def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
   
    
    with open(gloveFile, encoding="utf8" ) as f:
       content = f.readlines()
    model = {}
    for line in content:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model
    
    
model= loadGloveModel(file)   

print (model['hello'])

"""
Below is the output of the above code
Loading Glove Model
Done. 400000  words loaded!
[-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
  0.67204 ]
"""  

So we got numerical representation of word ‘hello’.
We can use also pandas to load GloVe file. Below are functions for loading with pandas and getting vector information.

import pandas as pd
import csv

words = pd.read_table(file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)


def vec(w):
  return words.loc[w].as_matrix()
 

print (vec('hello'))    #this will print same as print (model['hello'])  before
 

Finding Closest Word or Words

Now how do we find closest word to word “table”? We iterate through pandas dataframe, find deltas and then use numpy argmin function.
The closest word to some word will be always this word itself (as delta = 0) so I needed to drop the word ‘table’ and also next closest word ‘tables’. The final output for the closest word was “place”

words = words.drop("table", axis=0)  
words = words.drop("tables", axis=0)  

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name 


print (find_closest_word(model['table']))
#output:  place

#If we want retrieve more than one closest words here is the function:

def find_N_closest_word(v, N, words):
  Nwords=[]  
  for w in range(N):  
     diff = words.as_matrix() - v
     delta = np.sum(diff * diff, axis=1)
     i = np.argmin(delta)
     Nwords.append(words.iloc[i].name)
     words = words.drop(words.iloc[i].name, axis=0)
    
  return Nwords
  
  
print (find_N_closest_word(model['table'], 10, words)) 

#Output:
#['table', 'tables', 'place', 'sit', 'set', 'hold', 'setting', 'here', 'placing', 'bottom']

We can also use gensim word2vec library functionalities after we load GloVe file.

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file=file, word2vec_output_file="gensim_glove_vectors.txt")

###Finally, read the word2vec txt to a gensim model using KeyedVectors:

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

Difference between word2vec and GloVe

Both models learn geometrical encodings (vectors) of words from their co-occurrence information. They differ in the way how they learn this information. word2vec is using a “predictive” model (feed-forward neural network), whereas GloVe is using a “count-based” model (dimensionality reduction on the co-occurrence counts matrix). [3]

I hope you enjoyed reading this post about how to convert word to vector with GloVe and python. If you have any tips or anything else to add, please leave a comment below.

References
1. GloVe: Global Vectors for Word Representation
2. Load pretrained glove vectors in python
3. How is GloVe different from word2vec
4. Don’t count, predict! A systematic comparison of
context-counting vs. context-predicting semantic vectors

5. Words Embeddings