Document Similarity in Machine Learning Text Analysis with ELMo

In this post we will look at using ELMo for computing similarity between text documents. Elmo is one of the word embeddings techniques that are widely used now. In the previous post we used TF-IDF for calculating text documents similarity. TF-IDF is based on word frequency counting. Both techniques can be used for converting text to numbers in information retrieval machine learning algorithms.


The good tutorial that explains how ElMo is working and how it is built is Deep Contextualized Word Representations with ELMo
Another resource is at ELMo

We will however focus on the practical side of computing similarity between text documents with ELMo. Below is the code to accomplish this task. To compute elmo embeddings I used function from Analytics Vidhya machine learning post at learn-to-use-elmo-to-extract-features-from-text/

We will use cosine_similarity module from sklearn to calculate similarity between numeric vectors. It computes cosine similarity between samples in X and Y as the normalized dot product of X and Y.

# -*- coding: utf-8 -*-

from sklearn.metrics.pairwise import cosine_similarity

import tensorflow_hub as hub
import tensorflow as tf

elmo = hub.Module("", trainable=True)

def elmo_vectors(x):
  embeddings=elmo(x, signature="default", as_dict=True)["elmo"]
  with tf.Session() as sess:
    # return average of ELMo features

Our data input will be the same as in previous post for TF-IDF: collection the sentences as an array. So each document here is represented just by one sentence.

corpus=["I'd like an apple juice",
                            "An apple a day keeps the doctor away",
                             "Eat apple every day",
                             "We buy apples every week",
                             "We use machine learning for text classification",
                             "Text classification is subfield of machine learning"]

Below we do elmo embedding for each document and create matrix for all collection. If we print elmo_embeddings for i=0 we will get word embeddings vector [ 0.02739557 -0.1004054 0.12195794 … -0.06023929 0.19663551 0.3809018 ] which is numeric representation of the first document.

print (len(corpus))
for i in range(len(corpus)):
    print (corpus[i])

Finally we can print embeddings and similarity matrix

print ( elmo_embeddings)
print(cosine_similarity(elmo_embeddings, elmo_embeddings))

[array([ 0.02739557, -0.1004054 ,  0.12195794, ..., -0.06023929,
        0.19663551,  0.3809018 ], dtype=float32), array([ 0.08833811, -0.21392687, -0.0938901 , ..., -0.04924499,
        0.08270906,  0.25595033], dtype=float32), array([ 0.45237526, -0.00928468,  0.5245862 , ...,  0.00988374,
       -0.03330074,  0.25460464], dtype=float32), array([-0.14745474, -0.25623208,  0.20231596, ..., -0.11443609,
       -0.03759   ,  0.18829307], dtype=float32), array([-0.44559947, -0.1429281 , -0.32497618, ...,  0.01917108,
       -0.29726124, -0.02022664], dtype=float32), array([-0.2502797 ,  0.09800234, -0.1026585 , ..., -0.22239089,
        0.2981896 ,  0.00978719], dtype=float32)]

The similarity matrix computed as :
[[0.9999998  0.609864   0.574287   0.53863835 0.39638174 0.35737067]
 [0.609864   0.99999976 0.6036072  0.5824003  0.39648792 0.39825168]
 [0.574287   0.6036072  0.9999998  0.7760986  0.3858403  0.33461633]
 [0.53863835 0.5824003  0.7760986  0.9999995  0.4922789  0.35490626]
 [0.39638174 0.39648792 0.3858403  0.4922789  0.99999976 0.73076516]
 [0.35737067 0.39825168 0.33461633 0.35490626 0.73076516 1.0000002 ]]

Now we can compare this similarity matrix with matrix obtained with TF-IDF in prev post. Obviously they are different.

Thus, we calculated similarity between textual documents using ELMo. This post and previous post about using TF-IDF for the same task are great machine learning exercises. Because we use text conversion to numbers, document similarity in many algorithms of information retrieval, data science or machine learning.

Leave a Comment