Document Similarity, Tokenization and Word Vectors in Python with spaCY

Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be done just within few lines. Below you will find how to get document similarity , tokenization and word vectors with spaCY.

spaCY is an open-source library designed to help you build NLP applications. It has a lot of features, we will look in this post only at few but very useful.

Document Similarity

Here is how to get document similarity:

import spacy
nlp = spacy.load('en')

doc1 = nlp(u'Hello this is document similarity calculation')
doc2 = nlp(u'Hello this is python similarity calculation')
doc3 = nlp(u'Hi there')

print (doc1.similarity(doc2)) 
print (doc2.similarity(doc3)) 
print (doc1.similarity(doc3))  

Output:
0.94
0.33
0.30

In more realistic situations we would load documents from files and would have longer text. Here is the experiment that I performed. I saved 3 articles from different random sites, two about deep learning and one about feature engineering.

def get_file_contents(filename):
  with open(filename, 'r') as filehandle:  
    filecontent = filehandle.read()
    return (filecontent) 

fn1="deep_learning1.txt"
fn2="feature_eng.txt"
fn3="deep_learning.txt"

fn1_doc=get_file_contents(fn1)
print (fn1_doc)

fn2_doc=get_file_contents(fn2)
print (fn2_doc)

fn3_doc=get_file_contents(fn3)
print (fn3_doc)
 
doc1 = nlp(fn1_doc)
doc2 = nlp(fn2_doc)
doc3 = nlp(fn3_doc)
 
print ("dl1 - features")
print (doc1.similarity(doc2)) 
print ("feature - dl")
print (doc2.similarity(doc3)) 
print ("dl1 - dl")
print (doc1.similarity(doc3)) 
 
"""
output:
dl1 - features
0.9700237040142454
feature - dl
0.9656364096761337
dl1 - dl
0.9547075478662724
"""


It was able to assign higher similarity score for documents with similar topics!

Tokenization

Another very useful and simple feature that can be done with spaCY is tokenization. Here is how easy to convert text into tokens (words):

for token in doc1:
    print(token.text)
    print (token.vector)

Word Vectors

spaCY has integrated word vectors support, while other libraries like NLTK do not have it. Below line will print word embeddings – array of 768 numbers on my environment.

 
print (token.vector)   #-  prints word vector form of token. 
print (doc1[0].vector) #- prints word vector form of first token of document.
print (doc1.vector)    #- prints mean vector form for doc1

So we looked how to use few features (similarity, tokenization and word embeddings) which are very easy to implement with spaCY. I hope you enjoyed this post. If you have any tips or anything else to add, please leave a comment below.

References
1. spaCY
2. Word Embeddings in Python with Spacy and Gensim

2 thoughts on “Document Similarity, Tokenization and Word Vectors in Python with spaCY

  1. Hello, I would like to check similarity between two documents. I know your 2nd example deals with it but i believe the code is incomplete. How do I call get_file_contents in 2nd example?? Can you post full fledge working code ??

    • Hi Pratik,
      yes , you are correct, I missed one step. I updated the code above – inserted the step with this function get_file_contents . Basically you just do like this:
      fn1=”deep_learning1.txt”
      fn1_doc=get_file_contents(fn1)
      fn2_doc=get_file_contents(fn2)

      doc1 = nlp(fn1_doc)
      doc2 = nlp(fn2_doc)

      print (doc1.similarity(doc2))
      Thanks for catching this and best regards.

Leave a Comment