Document Similarity in Machine Learning Text Analysis with ELMo

In this post we will look at using ELMo for computing similarity between text documents. Elmo is one of the word embeddings techniques that are widely used now. In the previous post we used TF-IDF for calculating text documents similarity. TF-IDF is based on word frequency counting. Both techniques can be used for converting text to numbers in information retrieval machine learning algorithms.

ELMo

The good tutorial that explains how ElMo is working and how it is built is Deep Contextualized Word Representations with ELMo
Another resource is at ELMo

We will however focus on the practical side of computing similarity between text documents with ELMo. Below is the code to accomplish this task. To compute elmo embeddings I used function from Analytics Vidhya machine learning post at learn-to-use-elmo-to-extract-features-from-text/

We will use cosine_similarity module from sklearn to calculate similarity between numeric vectors. It computes cosine similarity between samples in X and Y as the normalized dot product of X and Y.

# -*- coding: utf-8 -*-

from sklearn.metrics.pairwise import cosine_similarity

import tensorflow_hub as hub
import tensorflow as tf

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)


def elmo_vectors(x):
  
  embeddings=elmo(x, signature="default", as_dict=True)["elmo"]
 
  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    # return average of ELMo features
    return sess.run(tf.reduce_mean(embeddings,1))

Our data input will be the same as in previous post for TF-IDF: collection the sentences as an array. So each document here is represented just by one sentence.

corpus=["I'd like an apple juice",
                            "An apple a day keeps the doctor away",
                             "Eat apple every day",
                             "We buy apples every week",
                             "We use machine learning for text classification",
                             "Text classification is subfield of machine learning"]

Below we do elmo embedding for each document and create matrix for all collection. If we print elmo_embeddings for i=0 we will get word embeddings vector [ 0.02739557 -0.1004054 0.12195794 … -0.06023929 0.19663551 0.3809018 ] which is numeric representation of the first document.

elmo_embeddings=[]
print (len(corpus))
for i in range(len(corpus)):
    print (corpus[i])
    elmo_embeddings.append(elmo_vectors([corpus[i]])[0])
   

Finally we can print embeddings and similarity matrix

print ( elmo_embeddings)
print(cosine_similarity(elmo_embeddings, elmo_embeddings))



[array([ 0.02739557, -0.1004054 ,  0.12195794, ..., -0.06023929,
        0.19663551,  0.3809018 ], dtype=float32), array([ 0.08833811, -0.21392687, -0.0938901 , ..., -0.04924499,
        0.08270906,  0.25595033], dtype=float32), array([ 0.45237526, -0.00928468,  0.5245862 , ...,  0.00988374,
       -0.03330074,  0.25460464], dtype=float32), array([-0.14745474, -0.25623208,  0.20231596, ..., -0.11443609,
       -0.03759   ,  0.18829307], dtype=float32), array([-0.44559947, -0.1429281 , -0.32497618, ...,  0.01917108,
       -0.29726124, -0.02022664], dtype=float32), array([-0.2502797 ,  0.09800234, -0.1026585 , ..., -0.22239089,
        0.2981896 ,  0.00978719], dtype=float32)]



The similarity matrix computed as :
[[0.9999998  0.609864   0.574287   0.53863835 0.39638174 0.35737067]
 [0.609864   0.99999976 0.6036072  0.5824003  0.39648792 0.39825168]
 [0.574287   0.6036072  0.9999998  0.7760986  0.3858403  0.33461633]
 [0.53863835 0.5824003  0.7760986  0.9999995  0.4922789  0.35490626]
 [0.39638174 0.39648792 0.3858403  0.4922789  0.99999976 0.73076516]
 [0.35737067 0.39825168 0.33461633 0.35490626 0.73076516 1.0000002 ]]

Now we can compare this similarity matrix with matrix obtained with TF-IDF in prev post. Obviously they are different.

Thus, we calculated similarity between textual documents using ELMo. This post and previous post about using TF-IDF for the same task are great machine learning exercises. Because we use text conversion to numbers, document similarity in many algorithms of information retrieval, data science or machine learning.

FastText Word Embeddings for Text Classification with MLP and Python

Word embeddings are widely used now in many text applications or natural language processing moddels. In the previous posts I showed examples how to use word embeddings from word2vec Google, glove models for different tasks including machine learning clustering:

GloVe – How to Convert Word to Vector with GloVe and Python

word2vec – Vector Representation of Text – Word Embeddings with word2vec

word2vec application – K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

In this post we will look at fastText word embeddings in machine learning. You will learn how to load pretrained fastText, get text embeddings and do text classification. As stated on fastText site – text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies. [1]

What is fastText

fastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. [1]

fastText, is created by Facebook’s AI Research (FAIR) lab. The model is an unsupervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages.[2]

As per Quora [6], Fasttext treats each word as composed of character ngrams. So the vector for a word is made of the sum of this character n grams. Word2vec (and glove) treat words as the smallest unit to train on. This means that fastText can generate better word embeddings for rare words. Also fastText can generate word embeddings for out of vocabulary word but word2vec and glove can not do this.

Word Embeddings File

I downloaded wiki file wiki-news-300d-1M.vec from here [4], but there are some other links where you can download different data files. I found this one has smaller size so it is easy to work with it.

Basic Operations with fastText Word Embeddings

To get most similar words to some word:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
print (model.most_similar('desk'))

"""
[('desks', 0.7923153638839722), ('Desk', 0.6869951486587524), ('desk.', 0.6602819561958313), ('desk-', 0.6187258958816528), ('credenza', 0.5955315828323364), ('roll-top', 0.5875717401504517), ('rolltop', 0.5837830305099487), ('bookshelf', 0.5758029222488403), ('Desks', 0.5755287408828735), ('sofa', 0.5617446899414062)]
"""

Load words in vocabulary:

words = []
for word in model.vocab:
    words.append(word)

To see embeddings:

print("Vector components of a word: {}".format(
    model[words[0]]
))

"""
Vector components of a word: [-0.0451  0.0052  0.0776 -0.028   0.0289  0.0449  0.0117 -0.0333  0.1055
 .......................................
 -0.1368 -0.0058 -0.0713]
"""

The Problem

So here we will use fastText word embeddings for text classification of sentences. For this classification we will use sklean Multi-layer Perceptron classifier (MLP).
The sentences are prepared and inserted into script:

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'machine', 'learning', 'book'],
			['one', 'more', 'new', 'book'],
		
          ['this', 'is', 'about', 'machine', 'learning', 'post'],
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
          ['this', 'is', 'the', 'last', 'machine', 'learning', 'book'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'packages'],
          ['orange', 'juice', 'is', 'liquid', 'extract', 'from', 'fruit', 'on', 'orange', 'tree']]

The sentences belong to two classes, the labels for classes will be assigned later as 0,1. So our problem is to classify above sentences. Below is the flowchart of the program that we will use for perceptron learning algorithm example.

Text classification using word embeddings
Text classification using word embeddings

Data Preparation

I converted this text input into digital using the following code. Basically I got word embedidings and averaged all words in the sentences. The resulting vector sentence representations were saved to array V.

import numpy as np

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
   
    return np.asarray(sent_vec) / numw


V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))   

After converting text into vectors we can divide data into training and testing datasets and attach class labels.

X_train = V[0:6]
X_test = V[6:9] 
          
Y_train = [0, 0, 0, 0, 1,1]
Y_test =  [0,1,1]   

Text Classification

Now it is time to load data to MLP Classifier to do text classification.

from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier(alpha = 0.7, max_iter=400) 
classifier.fit(X_train, Y_train)

df_results = pd.DataFrame(data=np.zeros(shape=(1,3)), columns = ['classifier', 'train_score', 'test_score'] )
train_score = classifier.score(X_train, Y_train)
test_score = classifier.score(X_test, Y_test)

print  (classifier.predict_proba(X_test))
print  (classifier.predict(X_test))

df_results.loc[1,'classifier'] = "MLP"
df_results.loc[1,'train_score'] = train_score
df_results.loc[1,'test_score'] = test_score

print(df_results)
     
"""
Output
  classifier  train_score  test_score
         MLP          1.0         1.0
"""

In this post we learned how to use pretrained fastText word embeddings for converting text data into vector model. We also looked how to load word embeddings into machine learning algorithm. And in the end of post we looked at machine learning text classification using MLP Classifier with our fastText word embeddings. You can find full python source code and references below.

from gensim.models import KeyedVectors
import pandas as pd

model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
print (model.most_similar('desk'))

words = []
for word in model.vocab:
    words.append(word)

print("Vector components of a word: {}".format(
    model[words[0]]
))
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'machine', 'learning', 'book'],
			['one', 'more', 'new', 'book'],
	    ['this', 'is', 'about', 'machine', 'learning', 'post'],
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
          ['this', 'is', 'the', 'last', 'machine', 'learning', 'book'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'packages'],
          ['orange', 'juice', 'is', 'liquid', 'extract', 'from', 'fruit', 'on', 'orange', 'tree']]
         
import numpy as np

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
   
    return np.asarray(sent_vec) / numw

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))   
         
    
X_train = V[0:6]
X_test = V[6:9] 
Y_train = [0, 0, 0, 0, 1,1]
Y_test =  [0,1,1]    
    
    
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier(alpha = 0.7, max_iter=400) 
classifier.fit(X_train, Y_train)

df_results = pd.DataFrame(data=np.zeros(shape=(1,3)), columns = ['classifier', 'train_score', 'test_score'] )
train_score = classifier.score(X_train, Y_train)
test_score = classifier.score(X_test, Y_test)

print  (classifier.predict_proba(X_test))
print  (classifier.predict(X_test))

df_results.loc[1,'classifier'] = "MLP"
df_results.loc[1,'train_score'] = train_score
df_results.loc[1,'test_score'] = test_score
print(df_results)

References
1. fasttext.cc
2. fastText
3.
Classification with scikit learn
4. english-vectors
5. How to use pre-trained word vectors from Facebook’s fastText
6. What is the main difference between word2vec and fastText?

Vector Representation of Text – Word Embeddings with word2vec

Computers can not understand the text. We need to convert text into numerical vectors before any kind of text analysis like text clustering or classification. The classical well known model is bag of words (BOW). With this model we have one dimension per each unique word in vocabulary. We represent the document as vector with 0s and 1s. We use 1 if the word from vocabulary exists in the document.

Recently new models with word embedding in machine learning gained popularity since they allow to keep semantic information. With word embeddings we can get lower dimensionality than with BOW model. There are several such models for example Glove, word2vec that are used in machine learning text analysis.

Many examples on the web are showing how to operate at word level with word embeddings methods but in the most cases we are working at the document level (sentence, paragraph or document) To get understanding how it can be used for text analytics I decided to take word2vect and create small practical example.

In this post you will learn how to use word embedding word2vect method for converting sentence into numerical vector. The same technique can be used for text with more than one sentence. We will create python script that converts sentences into numerical vectors.

Input

For the input for this script we will use hard coded in the script sentences. The sentences in the script will be already tokenized. Below you can find sentences for our input. Note that sentences 6 and 7 are more distinguish from other sentences.

1 [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
2 ['this', 'is',  'another', 'book'],
3 ['one', 'more', 'book'],
4 ['this', 'is', 'the', 'new', 'post'],
5 ['this', 'is', 'about', 'machine', 'learning', 'post'], 
6 ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
7 ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
8 ['and', 'this', 'is', 'the', 'last', 'post']]

With word2vec you have two options:
1. Create your own word2vec
2. Use pretrained data from Google

From word to sentence

Each word in word embeddings is represented by the vector. But let’s say we are working with tweets from twitter and need to know how similar or dissimilar are tweets? So we need to have vector representation of whole text in tweet. To achieve this we can do average word embeddings for each word in sentence (or tweet or paragraph) The idea come from paper [1]. In this paper the authors averaged word embeddings to get paragraph vector.

Source code for conversion


Below in Listing A and Listing B you can find how we can average word embeddings and get numerical vectors.
Listing A has the python source code for using own word embeddings.
Listing B has the python source code for using word embeddings from Google.
The script is taking embeddings from local file that was downloaded from Google before. You can find in this post Using Pretrained Word Embeddings in Machine Learning more details on downloading word embeddings from Google.

When averaging embeddings I was using 50 first dimensions. This is the minimal number that was used in one of the papers. The recommendation is to use between 100-400 dimensions.

Analysis of Results

How do we know that our results are good? We will do here a quick check as following. We will calculate the distance (similarity measure) between vectors and will compare with our expectation. If text sentences belong to different context then we expect the distance will be more and if sentences are close together then distance will be less. Because context of sentences 6 and 7 is different from other sentences we would expect to see this difference in results.

For calculating distance we use in the script cosine measure. With cosine measure most similar will be the one that have the highest cosine value. Below are results:
Note that 0 values mean that cosine value was not calculated because there is no need to do this. ( value already calculated for example for doc21 = doc12 or the value is on diagonal )

Results from Listing A (using own web embedings)
 1   2    3    4    5    6    7    8
1[0, 0.5, 0.1, 0.5, 0.6, 0.4, 0.2, 0.4],
2[0, 0,   0.2, 0.6, 0.5, 0.2, 0.1, 0.5],
3[0, 0,   0,   0.0, 0.0, 0.0, 0.1, 0.0],
4[0, 0,   0,   0,   0.6, 0.5, 0.3, 0.7],
5[0, 0,   0,   0,   0,   0.2, 0.2, 0.6],
6[0, 0,   0,   0,   0,   0,   0.4, 0.4], 
7[0, 0,   0,   0,   0,   0,   0,   0.3], 
8[0, 0,   0,   0,   0,   0,   0,   0]


Results from Listing B (using pretrained dataset):
  1  2     3     4     5     6     7     8
1[0, 0.77, 0.33, 0.57, 0.78, 0.35, 0.37, 0.55],
2[0, 0,    0.60, 0.62, 0.51, 0.31, 0.29, 0.59],
3[0, 0,    0,    0.16, 0.12, 0.18, 0.25, 0.11], 
4[0, 0,    0,    0,    0.62, 0.41, 0.37, 0.89],
5[0, 0,    0,    0,    0,    0.35, 0.27, 0.61], 
6[0, 0,    0,    0,    0,    0,    0.81, 0.37], 
7[0, 0,    0,    0,    0,    0,    0,    0.32],
8[0, 0,    0,    0,    0,    0,    0,    0]]

Looking at results we can see that our expectations are confirmed especially on results where pretrained word embeddings were used. Sentences 6,7 have low similarity with other sentences but have high similarity 0.81 when we compare sentence 6 with 7.

Conclusion

In this post we considered how to represent document (sentence, paragraph) as vector of numbers using word embeddings model word2vec. We looked at 2 possible ways – using own embeddings and using embeddings from Google. We got results for our small example and we were able to evaluate the results.

Now we can feed vector representation of text into machine learning text analysis algorithms.

Here are a few posts where you can find how to feed word2vec word embedding in text clustering algorithms such as kmeans from NLTK and sklearn libraries and how to plot data with TSNE :
K Means Clustering Example with Word2Vec in Data Mining or Machine Learning
Text Clustering with Word Embedding in Machine Learning

Below are few links for different word embedding models that are also widely used:
GloVe –
How to Convert Word to Vector with GloVe and Python
fastText –
FastText Word Embeddings

I hope you enjoyed this post about representing text as vector using word2vec. If you have any tips or anything else to add, please leave a comment in the reply box.

Listing A. Here is the python source code for using own word embeddings

from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'], 
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
			['and', 'this', 'is', 'the', 'last', 'post']]




model = Word2Vec(sentences, min_count=1, size=100)
vocab = model.vocab.keys()
wordsInVocab = len(vocab)
print (model.similarity('post', 'book'))


import numpy as np

def sent_vectorizer(sent, model):
    sent_vec = np.zeros(100)
    numw = 0
    for w in sent:
        try:
            sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
    return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))

from numpy import dot
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))] 

for i in range (len(V) - 1):
    for j in range(i+1, len(V)):
           results[i][j] = dot(V[i],V[j])/norm(V[i])/norm(V[j])


print (results)

Listing B. Here is the python source code for using word embeddings from Google.

import gensim
model = gensim.models.Word2Vec.load_word2vec_format('C:\\Users\\Downloads\\GoogleNews-vectors-negative300.bin', binary=True)  

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'], 
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
			['and', 'this', 'is', 'the', 'last', 'post']]


vocab = model.vocab.keys()
wordsInVocab = len(vocab)

import numpy as np

def sent_vectorizer(sent, model):
    sent_vec = np.zeros(50)
    numw = 0
    for w in sent:
        try:
            vc=model[w]
            vc=vc[0:50]
           
            sent_vec = np.add(sent_vec, vc) 
            numw+=1
        except:
            pass
    return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))] 

for i in range (len(V) - 1):
    for j in range(i+1, len(V)):
           
       NVI=norm(V[i])
       NVJ=norm(V[j])
           
       dotVij =0
       NVI=0
       for x in range(50):
           NVI=NVI +  V[i][x]*V[i][x]
           
       NVJ=0
       for x in range(50):
           NVJ=NVJ +  V[j][x]*V[j][x]
            
       for x in range(50):
      
               dotVij = dotVij + V[i][x] * V[j][x]
         
      
       results[i][j] = dotVij / (NVI*NVJ) 

print (results)

References
1. Document Embedding with Paragraph Vectors