Document Similarity in Machine Learning Text Analysis with TF-IDF

Despite of the appearance of new word embedding techniques for converting textual data into numbers, TF-IDF still often can be found in many articles or blog posts for information retrieval, user modeling, text classification algorithms, text analytics (extracting top terms for example) and other text mining techniques.

In this text we will look what is TF-IDF, how we can calculate TF-IDF, retrieve calculated values in different formats and how we compute similarity between 2 text documents using TF-IDF technique.

tf–idf is term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.[1]

Here we will look how we can convert text corpus of documents to numbers and how we can use above technique for computing document similarity.

We will use sklearn.feature_extraction.text.TfidfVectorizer from python scikit-learn library for calculating tf-idf. TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features.

We need to provide text documents as input, all other input parameters are optional and have default values or set to None. [2]

Here is the list of inputs from documentation:

input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None,
tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1),
max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False,dtype=, norm=’l2’,
use_idf=True, smooth_idf=True, sublinear_tf=False)

Our text documents will be represented just one sentence and all documents will be inputted as via array corpus.
Below code demonstrates how to get document similarity matrix.

# -*- coding: utf-8 -*-

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

corpus=["I'd like an apple juice",
                            "An apple a day keeps the doctor away",
                             "Eat apple every day",
                             "We buy apples every week",
                             "We use machine learning for text classification",
                             "Text classification is subfield of machine learning"]

vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(corpus)
print ((tfidf * tfidf.T).A)


"""
[[1.         0.2688172  0.16065234 0.         0.         0.        ]
 [0.2688172  1.         0.28397982 0.         0.         0.        ]
 [0.16065234 0.28397982 1.         0.19196066 0.         0.        ]
 [0.         0.         0.19196066 1.         0.13931166 0.        ]
 [0.         0.         0.         0.13931166 1.         0.48695659]
 [0.         0.         0.         0.         0.48695659 1.        ]]
""" 

We can print all our features or the values of features for specific document. In our example feature is a word, but it can be also 2 or more words:

print(vect.get_feature_names())
#['an', 'apple', 'apples', 'away', 'buy', 'classification', 'day', 'doctor', 'eat', 'every', 'for', 'is', 'juice', 'keeps', 'learning', 'like', 'machine', 'of', 'subfield', 'text', 'the', 'use', 'we', 'week']
print(tfidf.shape)
#(6, 24)


print (tfidf[0])
"""
  (0, 15)	0.563282410145744
  (0, 0)	0.46189963418608976
  (0, 1)	0.38996740989416023
  (0, 12)	0.563282410145744
"""  

We can load features in dataframe and print them from dataframe in several ways:

df=pd.DataFrame(tfidf.toarray(), columns=vect.get_feature_names())

print (df)

"""
         an     apple    apples    ...          use        we      week
0  0.461900  0.389967  0.000000    ...     0.000000  0.000000  0.000000
1  0.339786  0.286871  0.000000    ...     0.000000  0.000000  0.000000
2  0.000000  0.411964  0.000000    ...     0.000000  0.000000  0.000000
3  0.000000  0.000000  0.479748    ...     0.000000  0.393400  0.479748
4  0.000000  0.000000  0.000000    ...     0.431849  0.354122  0.000000
5  0.000000  0.000000  0.000000    ...     0.000000  0.000000  0.000000
"""

with pd.option_context('display.max_rows', None, 'display.max_columns', None):   
    print(df)

"""
     doctor       eat     every       for        is     juice     keeps  \
0  0.000000  0.000000  0.000000  0.000000  0.000000  0.563282  0.000000   
1  0.414366  0.000000  0.000000  0.000000  0.000000  0.000000  0.414366   
2  0.000000  0.595054  0.487953  0.000000  0.000000  0.000000  0.000000   
3  0.000000  0.000000  0.393400  0.000000  0.000000  0.000000  0.000000   
4  0.000000  0.000000  0.000000  0.431849  0.000000  0.000000  0.000000   
5  0.000000  0.000000  0.000000  0.000000  0.419233  0.000000  0.000000   

   learning      like   machine        of  subfield      text       the  \
0  0.000000  0.563282  0.000000  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.414366   
2  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
3  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
4  0.354122  0.000000  0.354122  0.000000  0.000000  0.354122  0.000000   
5  0.343777  0.000000  0.343777  0.419233  0.419233  0.343777  0.000000   

        use        we      week  
0  0.000000  0.000000  0.000000  
1  0.000000  0.000000  0.000000  
2  0.000000  0.000000  0.000000  
3  0.000000  0.393400  0.479748  
4  0.431849  0.354122  0.000000  
5  0.000000  0.000000  0.000000  

"""    
# this prints but not nice as above    
print(df.to_string())    



print ("Second Column");
print (df.iloc[1])
"""
an                0.339786
apple             0.286871
apples            0.000000
away              0.414366
buy               0.000000
classification    0.000000
day               0.339786
doctor            0.414366
eat               0.000000
every             0.000000
for               0.000000
is                0.000000
juice             0.000000
keeps             0.414366
learning          0.000000
like              0.000000
machine           0.000000
of                0.000000
subfield          0.000000
text              0.000000
the               0.414366
use               0.000000
we                0.000000
week              0.000000
"""
print ("Second Column only values (without keys");
print (df.iloc[1].values)

"""
[0.33978594 0.28687063 0.         0.41436586 0.         0.
 0.33978594 0.41436586 0.         0.         0.         0.
 0.         0.41436586 0.         0.         0.         0.
 0.         0.         0.41436586 0.         0.         0.        ]
""" 

Finally we can compute document similarity matrix using cosine_similarity. And we got the same matrix that we got in the beginning using just ((tfidf * tfidf.T).A).

print(cosine_similarity(df.values, df.values))

"""
[[1.         0.2688172  0.16065234 0.         0.         0.        ]
 [0.2688172  1.         0.28397982 0.         0.         0.        ]
 [0.16065234 0.28397982 1.         0.19196066 0.         0.        ]
 [0.         0.         0.19196066 1.         0.13931166 0.        ]
 [0.         0.         0.         0.13931166 1.         0.48695659]
 [0.         0.         0.         0.         0.48695659 1.        ]]
""" 

print ("Number of docs in corpus")
print (len(corpus))

So in this post we learned how to use tf idf sklearn, get values in different formats, load to dataframe and calculate document similarity matrix using just tfidf values or cosine similarity function from sklearn.metrics.pairwise. This techniques can be used in machine learning text analysis, information retrieval machine learning, text mining process and many other areas when we need convert textual data into numeric data (or features).

References
1. Tf-idf – Wikipedia
2. TfidfVectorizer

Text Classification of Different Datasets with CNN Convolutional Neural Network and Python

In this post we explore machine learning text classification of 3 text datasets using CNN Convolutional Neural Network in Keras and python. As reported on papers and blogs over the web, convolutional neural networks give good results in text classification.

Datasets

We will use the following datasets:
1. 20 newsgroups text dataset that is available from scikit learn here.
2. Dataset of web pages. The web documents are downloaded manually from web and belong to two categories : text mining or hidden markov models (HMM). This is small dataset that consists only of 20 pages for text mining and 11 pages for HMM group.
3. Datasets of tweets about Year Resolutions, obtained from data.world/crowdflower here.

Convolutional Neural Network Architecture

Our CNN will be based on Richard Liao code from [1], [2]. We use convolutional neural network that is built with different layers such as Embedding , Conv1D, Flatten, Dense. For embedding we utilize pretrained glove dataset that can be downloaded from web.

The data flow diagram with layers used is shown below.

CNN diagram
CNN diagram

Here is the code for obtaining convolutional neural net diagram like this. Insert it after model.fit (…) line. It requires installation of pydot and graphviz however.

model.fit(.....)

import pydot
pydot.find_graphviz = lambda: True
print (pydot.find_graphviz())

import os
os.environ["PATH"] += os.pathsep + "C:\\Program Files (x86)\\Graphviz2.38\\bin"

from keras.utils import plot_model
plot_model(model, to_file='model.png')

1D Convolution

In our neural net convolution is performed in several 1 dimensional convolution layers (Conv1D)
1D convolution means that just 1-direction is used to calculate convolution.[3]
For example:
input = [1,1,1,1,1], filter = [0.25,0.5,0.25], output = [1,1,1,1,1]
output-shape is 1D array
We can also apply 1D convolution for 2D data matrix – as we use in text classification.
The good explanation of convolution in text can be found in [6]

Text Classifiction of 20 Newsgroups Text Dataset

For this dataset we use only 2 categories. The script is provided here The accuracy of network is 87%. Trained on 864 samples, validate on 215 samples.
Summary of run: loss: 0.6205 – acc: 0.6632 – val_loss: 0.5122 – val_acc: 0.8651

Document classification of Web Pages.

Here we use also 2 categories. Python script is provided here.

Web page were manually downloaded from web and saved locally in two folders, one for each category. The script is loading web page files from locale storage. Next is preprocessing step to remove web tags but keep text content. Here is the function for this:

def get_only_text_from_html_doc(page):
 """ 
  return the title and the text of the article
 """
 
 soup = BeautifulSoup(page, "lxml")
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text + " " + text  

Accuracy on this dataset was 100% but was not consistent. In some other runs the result was only 83%.
Trained on 25 samples, validate on 6 samples.
Summary of run – loss: 0.0096 – acc: 1.0000 – val_loss: 0.0870 – val_acc: 1.0000

Text Classification of Tweet Dataset

The script is provided here.
Here is the accuracy was 93%. Trained on 4010 samples, validate on 1002 samples.
Summary of run – loss: 0.0193 – acc: 0.9958 – val_loss: 0.6690 – val_acc: 0.9281.

Conclusion

We learned how to do text classification for 3 different types of text datasets (Newsgroups, tweets, web documents). For text classification we used Convolutional Neural Network python and on all 3 datasets we got good performance on accuracy.

References

1. Text Classification, Part I – Convolutional Networks
2. textClassifierConv
3. What do you mean by 1D, 2D and 3D Convolutions in CNN?
4.How to implement Sentiment Analysis using word embedding and Convolutional Neural Networks on Keras.
5. Understanding Convolutional Neural Networks for NLP
6. Understanding Convolutions in Text
7. Recurrent Neural Networks I