Topic Modeling Python and Textacy Example

Topic modeling is automatic discovering the abstract “topics” that occur in a collection of documents.[1] It can be used for providing more informative view of search results, quick overview for set of documents or some other services.

Textacy

In this post we will look at topic modeling with textacy. Textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library.
It can flexibly tokenize and vectorize documents and corpora, then train, interpret, and visualize topic models using LSA, LDA, or NMF methods. [2]
Textacy is less known than other python libraries such as NLTK, SpaCY, TextBlob [3] But it looks very promising as it’s built on the top of spaCY.

In this post we will use textacy for the following task. We have group of documents and we want extract topics out of this set of documents. We will use 20 Newsgroups dataset as the source of documents.

Code Structure

Our code consist of the following steps:
Get data. We will use only 2 groups (alt.atheism’, ‘soc.religion.christian’).
Tokenize and remove some not needed characters or stopwords.
Vectorize.
Extract Topics. Here we do actual topic modeling. We use Non-negative Matrix Factorization method. (NMF)
Output graph of terms – topic matrix.

Output

Below is the final output plot.

Topic modeling with textacy
Topic modeling with textacy

Looking at output graph we can see term distribution over the topics. We identified more than 2 topics. For example topic 2 is associated with atheism, while topic 1 is associated with God, religion.

While better data preparation is needed to remove few more non meaningful words, the example still showing that to do topic modeling with textacy is much easy than with some other modes (for example gensim). This is because it has ability to do many things that you need do after NLP versus just do NLP and allow user then add additional data views, heatmaps or diagrams.

Here are few links with topic modeling using LDA and gensim (not using textacy). The posts demonstrate that it is required more coding comparing with textacy.
Topic Extraction from Blog Posts with LSI , LDA and Python
Data Visualization – Visualizing an LDA Model using Python

Source Code

Below is python full source code.

categories = ['alt.atheism', 'soc.religion.christian'] 

#Loading the data set - training data.
from sklearn.datasets import fetch_20newsgroups
 
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, categories=categories, remove=('headers', 'footers', 'quotes'))
 
# You can check the target names (categories) and some data files by following commands.
print (newsgroups_train.target_names) #prints all the categories
print("\n".join(newsgroups_train.data[0].split("\n")[:3])) #prints first line of the first data file
print (newsgroups_train.target_names)
print (len(newsgroups_train.data))
 
texts = []
 
labels=newsgroups_train.target
texts = newsgroups_train.data

from nltk.corpus import stopwords

import textacy
from textacy.vsm import Vectorizer

terms_list=[[tok  for tok in doc.split() if tok not in stopwords.words('english') ] for doc in texts]
 

count=0            
for doc in terms_list:
 for word in doc:   
   print (word) 
   if word == "|>" or word == "|>" or word == "_" or word == "-" or word == "#":
         terms_list[count].remove (word)
   if word == "=":
         terms_list[count].remove (word)
   if word == ":":
         terms_list[count].remove (word)    
   if word == "_/":
         terms_list[count].remove (word)  
   if word == "I" or word == "A":
         terms_list[count].remove (word)
   if word == "The" or word == "But" or word=="If" or word=="It":
         terms_list[count].remove (word)       
 count=count+1
      

print ("=====================terms_list===============================")
print (terms_list)


vectorizer = Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
doc_term_matrix = vectorizer.fit_transform(terms_list)


print ("========================doc_term_matrix)=======================")
print (doc_term_matrix)



#initialize and train a topic model:
model = textacy.tm.TopicModel('nmf', n_topics=20)
model.fit(doc_term_matrix)

print ("======================model=================")
print (model)

doc_topic_matrix = model.transform(doc_term_matrix)
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1]):
          print('topic', topic_idx, ':', '   '.join(top_terms))

for i, val in enumerate(model.topic_weights(doc_topic_matrix)):
     print(i, val)
     
     
print   ("doc_term_matrix")     
print   (doc_term_matrix)   
print ("vectorizer.id_to_term")
print (vectorizer.id_to_term)
         

model.termite_plot(doc_term_matrix, vectorizer.id_to_term, topics=-1,  n_terms=25, sort_terms_by='seriation')  
model.save('nmf-10topics.pkl')        


References
1.Topic Model
2.textacy: NLP, before and after spaCy
3.5 Heroic Python NLP Libraries

Text Mining Techniques for Search Results Clustering

Text search box can be found almost in every web based application that has text data. We use search feature when we are looking for customer data, jobs descriptions, book reviews or some other information. Simple keyword matching can be enough in some small tasks. However when we have many results something better than keyword match would be very helpful. Instead of going through a lot of results we would get results grouped by topic with a nice summary of topics. It would allow to see information at first sight.

In this post we will look in some machine learning algorithms, applications and frameworks that can analyze output of search function and provide useful additional information for search results.

Machine Learning Clustering for Search Results

Search results clustering problem is defined as an automatic, on-line grouping of similar documents in a search results list returned from a search engine. [1] Carrot2 is the tool that was built to solve this problem.
Carrot2 is Open Source Framework for building Search Results Clustering Engine. This tool can do search, cluster and visualize clusters. Which is very cool. I was not able to find similar like this tool in the range of open source projects. If you are aware of such tool, please suggest in the comment box.

Below are screenshots of clustering search results from Carrot2

Clustering search results with Carrot2
Clustering search results with Carrot2
Aduna cluster map visualization clusters
Aduna cluster map visualization clusters with Carrot2

The following algorithms are behind Carrot2 tool:
Lingo algorithm constructs a “term-document matrix” where each snippet gets a column, each word a row and the values are the frequency of that word in that snippet. It then applies a matrix factorization called singular value decomposition or SVD. [3]

Suffix Tree Clustering (STC) uses the generalised suffix tree data structure, to efficiently build a list of the most frequently used phrases in the snippets from the search results. [3]

Topic modelling

Topic modelling is another approach that is used to identify which topic is discussed in documents or text snippets provided by search function. There are several methods like LSA, pLSA, LDA [11]

Comprehensive overview of Topic Modeling and its associated techniques is described in [12]

Topic modeling can be represented via below diagram. Our goal is identify topics given documents with the words

Topic modeling diagram
Topic modeling diagram

Below is plate notation of LDA model.

Plate notation of LDA model
Plate notation of LDA model

Plate notation representing the LDA model. [19]
αlpha is the parameter of the Dirichlet prior on the per-document topic distributions,
βeta is the parameter of the Dirichlet prior on the per-topic word distribution,
p is the topic distribution for document m,
Z is the topic for the n-th word in document m, and
W is the specific word.

We can use different NLP libraries (NLTK, spaCY, gensim, textacy) for topic modeling.
Here is the example of topic modeling with textacy python library:
Topic Modeling Python and Textacy Example

Here are examples of topic modeling with gensim library:
Topic Extraction from Blog Posts with LSI , LDA and Python
Data Visualization – Visualizing an LDA Model using Python

Using Word Embeddings

Word embeddings like gensim, word2vec, glove showed very good results in NLP and are widely used now. This is also used for search results clustering. The first step would be create model for example gensim. In the next step text data are converted to vector representation. Words embedding improve preformance by leveraging information on how words are semantically correlated to each other [7][10]

Neural Topic Model (NTM) and Other Approaches

Below are some other approaches that can be used for topic modeling for search results organizing.
Neural topic modeling – combines a neural network with a latent topic model. [14]
Topic modeling with Deep Belief Nets is described in [17]. The concept of the method is to load bag-of-words (BOW) and produce a strong latent representation that will then be used for a content based recommender system. The authors report that model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classification tasks.

Thus we looked at different techniques for search results clustering. In the future posts we will implement some of them. What machine learning methods do you use for presenting search results? I would love to hear.

References
1. Lingo Search Results Clustering Algorithm
2. Carrot2 Algorithms
3. Carrot2
4. Apache SOLR and Carrot2 integration strategies
5. Topical Clustering of Search Results
6. K-means clustering for text dataset
7. Document Clustering using Doc2Vec/word2vec
8 Automatic Topic Clustering Using Doc2Vec
9. Search Results Clustering Algorithm
10. LDA2vec: Word Embeddings in Topic Models
11. Topic Modelling in Python with NLTK and Gensim
12. Topic Modeling with LSA, PLSA, LDA & lda2Vec
13. Text Summarization with Amazon Reviews
14. A Hybrid Neural Network-Latent Topic Model
15. docluster
16. Deep Belief Nets for Topic Modeling
17. Modeling Documents with a Deep Boltzmann Machine
18. Beginners guide to topic modeling in python
19. Latent Dirichlet allocation