Automatic Text Summarization Online

In the previous post Automatic Text Summarization with Python I showed how to use different python libraries for text summarization. Recently I added text summarization modules to online site Online Machine Learning Algorithms. So now you can play with text summarization modules online and select best summary generator. This service is the free tool that allows to run some algorithms without coding or installing software modules.

Below are the steps how to use online text summarizer models of Machine Learning Algorithms tool.

How to use online text summarizer algorithms

1. Access the link Online Machine Learning Algorithms : Online Machine Learning Algorithms tool.
Select text summarization algorithm that you want to run. There is one available with gensim and 3 with sumy python modules. We will use Luhn text summarizer algorithm. The algorithms from gensim and sumy python modules are still widely used in automatic text summarization which is part of the field of natural language processing.

Running online text summarization step1
Running online text summarization step1

2. Input the data that you want to run or click on Load Default Values. Note that you need to enter about 10 sentences at least. It will not work if you enter just few words or just one sentence.

Running online text summarization step2

3. Click Run now.

4. Click View Run Results link.

Running online text summarization -  example of output
Running online text summarization – example of output

5. Click Refresh Page button on this new page , you maybe will need click few times untill data output show up. Usually it takes less than 1 min, but it will depend how much data you need to process.
Scroll to the bottom page to see results.

If you try other text summarizers from this online tool you will see that there are some differences in generated text summaries.

End Notes

In this post, we covered how to use online text summarizer models of Machine Learning Algorithms tool available here You can run online algorithms from gensim and sumy python modules.
Feel free to provide comments or suggestions.

Document Similarity, Tokenization and Word Vectors in Python with spaCY

Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be done just within few lines. Below you will find how to get document similarity , tokenization and word vectors with spaCY.

spaCY is an open-source library designed to help you build NLP applications. It has a lot of features, we will look in this post only at few but very useful.

Document Similarity

Here is how to get document similarity:

import spacy
nlp = spacy.load('en')

doc1 = nlp(u'Hello this is document similarity calculation')
doc2 = nlp(u'Hello this is python similarity calculation')
doc3 = nlp(u'Hi there')

print (doc1.similarity(doc2)) 
print (doc2.similarity(doc3)) 
print (doc1.similarity(doc3))  

Output:
0.94
0.33
0.30

In more realistic situations we would load documents from files and would have longer text. Here is the experiment that I performed. I saved 3 articles from different random sites, two about deep learning and one about feature engineering.

def get_file_contents(filename):
  with open(filename, 'r') as filehandle:  
    filecontent = filehandle.read()
    return (filecontent) 

fn1="deep_learning1.txt"
fn2="feature_eng.txt"
fn3="deep_learning.txt"

fn1_doc=get_file_contents(fn1)
print (fn1_doc)

fn2_doc=get_file_contents(fn2)
print (fn2_doc)

fn3_doc=get_file_contents(fn3)
print (fn3_doc)
 
doc1 = nlp(fn1_doc)
doc2 = nlp(fn2_doc)
doc3 = nlp(fn3_doc)
 
print ("dl1 - features")
print (doc1.similarity(doc2)) 
print ("feature - dl")
print (doc2.similarity(doc3)) 
print ("dl1 - dl")
print (doc1.similarity(doc3)) 
 
"""
output:
dl1 - features
0.9700237040142454
feature - dl
0.9656364096761337
dl1 - dl
0.9547075478662724
"""


It was able to assign higher similarity score for documents with similar topics!

Tokenization

Another very useful and simple feature that can be done with spaCY is tokenization. Here is how easy to convert text into tokens (words):

for token in doc1:
    print(token.text)
    print (token.vector)

Word Vectors

spaCY has integrated word vectors support, while other libraries like NLTK do not have it. Below line will print word embeddings – array of 768 numbers on my environment.

 
print (token.vector)   #-  prints word vector form of token. 
print (doc1[0].vector) #- prints word vector form of first token of document.
print (doc1.vector)    #- prints mean vector form for doc1

So we looked how to use few features (similarity, tokenization and word embeddings) which are very easy to implement with spaCY. I hope you enjoyed this post. If you have any tips or anything else to add, please leave a comment below.

References
1. spaCY
2. Word Embeddings in Python with Spacy and Gensim