7+ Best Online Resources for Text Preprocessing for Machine Learning Algorithms

With advance of machine learning , natural language processing and increasing available information on the web, the use of text data in machine learning algorithms is growing. The important step in using text data is preprocessing original raw text data. The data preparation steps may include the following:

  • Tokenization
  • Removing punctuation
  • Removing stop words
  • Stemming
  • Word Embedding
  • Named-entity recognition (NER)
  • Coreference resolution – finding all expressions that refer to the same entity in a text

Recently created new articles on this topic, greatly expanded examples of text preprocessing operations. In this post we collect and review online articles that are describing text prepocessing techniques with python code examples.

1. textcleaner

Text-Cleaner is a utility library for text-data pre-processing. It can be used before passing the text data to a model. textcleaner uses a open source projects such as NLTK – for advanced cleaning, REGEX – for regular expression.


  • main_cleaner does all the below in one call
  • remove unnecessary blank lines
  • transfer all characters to lowercase if needed
  • remove numbers, particular characters (if needed), symbols and stop-words from the whole text
  • tokenize the text-data on one call
  • stemming & lemmatization powered by NLTK
  • textcleaner is saving time by providing basic cleaning functionality and allowing developer to focus on building machine learning model. The nice thing is that it can do many text processing steps in one call.

    Here is the example how to use:

    import textcleaner as tc
    print (out)
    input text:
    The house235 is very small!!
    the city is nice.
    I was in that city 10 days ago.
    The city2 is big.
    output text:
    [['hous', 'small'], ['citi', 'nice'], ['citi', 'day', 'ago'], ['citi', 'big']]

    2. Guide for Text Preprocessing from Analytics Vidhya

    Analytics Vidhya regularly provides great practical resources about AI, ML, Analytics. In this ‘Ultimate guide to deal with Text Data’ you can find description of text preprocessing steps with python code. Different python libraries are utilized for solving text preprocessing tasks:
    NLTK – for stop list, stemming

    TextBlob – for spelling correction, tokenization, lemmatization. TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

    gensim – for word embeddings

    sklearn – for feature_extraction with TF-IDF

    The guide is covering text processing steps from basic to advanced.
    Basic steps :

    • Lower casing
    • Punctuation, stopwords, frequent and rare words removal
    • Spelling correction
    • Tokenization
    • Stemming
    • Lemmatization

    Advance Text Processing

    • N-grams
    • Term, Inverse Document Frequency
    • Term Frequency-Inverse Document Frequency (TF-IDF)
    • Bag of Words
    • Sentiment Analysis
    • Word Embedding

    3. Guide to Natural Language Processing  

    Often we extract text data from the web and we need strip out HTML before feeding to ML algotithms.
    Dipanjan (DJ) Sarkar in his post ‘A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text’ is showing how to do this.

    Here we can find project for downloading html text with beatifulsoup python library, extracting useful text from html, doing part analysis, sentiment analysis and NER.
    In this post we can find the foolowing text processing python libraries for machine learning :
    spacy – spaCy now features new neural models for tagging, parsing and entity recognition (in v2.0)
    nltk – leading platform for building Python programs for natural language processing.

    Basic text preprocessing steps covered:

    • Removing HTML tags
    • Removing accented characters, Special Characters, Stopwords
    • Expanding Contractions
    • Stemming
    • Lemmatization

    In addition to above basic steps the guide is also covering parsing techniques for understanding the structure and syntax of language that includes

    • Parts of Speech (POS) Tagging
    • Shallow Parsing or Chunking
    • Constituency Parsing
    • Dependency Parsing
    • Named Entity Recognition

    4. Natural Language Processing

    In this article ‘Natural Language Processing is Fun’ you will find descriptions on the text pre-processing steps:

    • Sentence Segmentation
    • Word Tokenization
    • Predicting Parts of Speech for Each Token
    • Text Lemmatization
    • Identifying Stop Words
    • Dependency Parsing
    • Named Entity Recognition (NER)
    • Coreference Resolution

    The article explains thoroughly how computers understand textual data by dividing text processing into the above steps. Diagrams help understand concepts very easy. The steps above constitute natural language processing text pipeline and it turn out that with the spacy you can do most of them with only few lines.

    Here is the example of using spacy:

    import spacy
    # Load the large English NLP model
    nlp = spacy.load('en_core_web_lg')
    with open(f) as ftxt:
         text = ftxt.read()
    print (text)     
    # Parse the text with spaCy.
    doc = nlp(text)
    for token in doc:
    for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
              token.shape_, token.is_alpha, token.is_stop) 
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
    Partial output of above program: 
    I -PRON- PRON PRP nsubj X True False
    was be VERB VBD ROOT xxx True False
    in in ADP IN prep xx True False
    that that DET DT det xxxx True False
    city city NOUN NN pobj xxxx True False
    10 10 NUM CD nummod dd False False
    days day NOUN NNS npadvmod xxxx True False
    ago ago ADV RB advmod xxx True False
    . . PUNCT . punct . False False
    10 days ago 66 77 DATE

    5. Learning from Text Summarization Project

    This is project ‘Text Summarization with Amazon Reviews’ where review are about food, but the first part contains text preprocessing steps. The preprocessing steps include converting to lowercase, replacing contractions with their longer forms, removing unwanted characters.

    For removing contractions author is using a list of contractions from stackoverflow
    Using the list and the code from this link, we can replace, for example:
    you’ve with you have
    she’s with she is

    6. Text Preprocessing Methods for Deep Learning

    This is a primer on word2vec embeddings but it includes basic preprocessing techniques for text data such as

    • Cleaning Special Characters and Removing Punctuations
    • Cleaning Numbers
    • Removing Misspells
    • Removing Contractions

    7. Text Preprocessing in Python

    This is another great resource about text preprocessing steps with python. In addition to basic steps, we can find here how to do collocation extraction, relationship extraction and NER. The paper has many links to other articles on text preprocessing techniques.

    Also this paper has comparison of many different natural language processing toolkits like NLTK, Spacy by features, programming language, license. The table has the links to project for text processing toolkit. So it is very handy information where you can find description of text processing steps, tools used, examples of using and link to many other resources.


    The above resources show how to perform textual data preprocessing from basic step to advanced, with different python libraries. Below you can find the above links and few more links to resources on the same topic.
    Feel free to provide feedback, comments, links to resources that are not mentioned here.


    1. textcleaner
    2. Ultimate guide to deal with Text Data (using Python) – for Data Scientists & Engineers
    3. A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text
    4. Natural Language Processing is Fun
    5. Text Summarization with Amazon Reviews
    6. NLP Learning Series: Text Preprocessing Methods for Deep Learning
    7. Text Preprocessing in Python: Steps, Tools, and Examples
    8. Text Data Preprocessing: A Walkthrough in Python
    9. Text Preprocessing, Keras Documentation
    10. What is the best way to remove accents in a Python unicode string?
    12. Processing Raw Text
    13. TextBlob: Simplified Text Processing

    Chatbots Examples with ChatterBot – How to Add Logic

    In the previous post How to Create a Chatbot with ChatBot Open Source and Deploy It on the Web I wrote how to deploy ChatterBot on pythonanywhere hosting site with Django webfamework. In this post we will look at few useful chatbots examples for implementing logic in our chatbot. This chatbot was developed in the previous post and is based on ChatterBot python library.

    Making Chatbot Start Conversation with Specific Question

    Suppose we want to start conversation with specific sentence that chatbot needs to show. For example, when the user open website, the chatbot can start with the specific question: “Did you find what you were looking for?”
    Or, if you are building chatbot for conversation about previous day/week work, you probably want to start with “How was you previous day/week in terms of progress to goal?” How can we do this with ChatterBot framework?

    Conversation diagram

    It turns out, that ChatterBot has several logic adapters that allow to build conversation per different requirements.

    Here is how I used logic adapter SpecificResponseAdapter for chatbot to start with initial predefined question:

     chatbot = ChatBot("mybot",
                'import_path': 'chatterbot.logic.SpecificResponseAdapter',
                'input_text': 'prev_day_disk',
                'output_text': 'How much did you do toward your goal on previous day?'

    In views.py that was created in prev. post[1], I put input “prev_day_disk” to runbot instead of blank string. Because in the beginning of chat there is no user input and I used this to enter input_text and get desired output as specified in output_text.

    def press_my_buttons(request):
        if request.POST:
            conv=request.POST.get('conv', '')
            user_input=request.POST.get('user_input', '')
            userid=request.POST.get('userid', '')
            if (userid == ""):
            resp=runbot(user_input, request, userid)
            conv=conv + "" + str(user_input) + "\n" + "BOT:"+ str(resp) + "\n"
            resp=runbot("prev_day_disk", request, "")
            conv =  "BOT:"+ str(resp) + "\n";
        return render(request, 'my_template.html', {'conv': conv })

    SpecificResponseAdapter can be used also in other places of conversation (not just in the beginning). For example we could use criteria if there is no input from user during 15 secs and user is not typing anything (not sure yet how easily it is to check if user typing or not) then switch conversation to new topic by making chatbot app send new question.

    How to Add Intelligence to Chatbot App

    After the user replied to response how was his/her week, I want chatbot to be able to recognize the response as belonging to one of the 3 groups: bad, so-so, good. This is a machine learning classification problem. However here we will not create text classification algorithm, instead we will use built in functionality.

    We will use another logic adapter, called BestMatch.
    With this adapter we need specify statement_comparison_function and response_selection_method :

    chatbot = ChatBot("mybot",
                'import_path': 'chatterbot.logic.SpecificResponseAdapter',
                'input_text': 'How much did you do toward your goal on previous day?',
                'output_text': 'How much did you do toward your goal on previous day?'
                "import_path": "chatterbot.logic.BestMatch",
                "statement_comparison_function": "chatterbot.comparisons.levenshtein_distance",
                "response_selection_method": "chatterbot.response_selection.get_first_response"

    Best Match Adapter – is a logic adapter that returns a response based on known responses to the closest matches to the input statement. [2]

    The best match adapter uses a function to compare the input statement to known statements. Once it finds the closest match to the input statement, it uses another function to select one of the known responses to that statement.

    To use this adapter for the above example I need at minimum create 1 samples per each group. In the below example I used following for testing of chatbot on 2 groups (skipped so-so).

     if (train_bot == True):
        "I did not do much this week",
        "Did you run into the problems with programs or just did not have time?"
        "I did a lot of progress",
        "Fantastic! Keep going on"

    After the training, if the user enters something close to “I did not do much this week” the chatbot will respond with “Did you run into the problems with programs or just did not have time?”, and if user enters something like “Did a lot of progress” the bot response will be “Fantastic! Keep going on” even if the input is slightly different from training.

    So we looked how to build a chatbot with logic that makes chatbot able to ask questions as needed or classify user input in several buckets and respond to user input accordingly. The code for this post is provided at the link listed below[3]

    1. How to Create a Chatbot with ChatBot Open Source and Deploy It on the Web
    2. ChatterBot – Logic
    3.Python Chatterbot Example with Added Logic – source code

    How to Create a Chatbot with ChatBot Open Source and Deploy It on the Web

    Chatbots have become very popular due to progress in AI, ML and NLP. They are now used on many websites. With increased popularity of chatbots there are many different frameworks to create chatbot. We will explore one of such framework in this post. We will review how to create a chatbot and deploy online based on open source for ChatterBot. Our platform will be Django on pythonanywhere.

    This chatbot tutorial introduces the basic concepts and terms needed to understand and deploy chatbot ChatterBot and provides a simple usage example.

    What is ChatterBot?

    ChatterBot is a Python library that makes it easy to generate automated responses to a user’s input. To produce different types of responses ChatterBot uses a selection of machine learning algorithms such as search and /or classification algorithms. This makes it easy for developers to create chat bots and automate conversations with users. [1]

    To install this chatbot open source you need just run:
    pip install chatterbot

    ChatterBot Features

    The functionality of chatterBot is provided through adapters – a pluggable class that allows a ChatBot instance to execute some kind of functionality.

    ChatBot has the following groups of adapters
    input adapters
    output adapters
    storage adapters
    logic adapters

    Inside of each group there several adapters that support different functionalities. For example within logic adapters we can use Best Match Adapter, Time Logic Adapter and few other.

    Here is the example how do you run chatbot. Below is the output of running the code:

    Result of program run:
    USER: How are you doing?
    BOT:I am fine
    USER: Hello
    BOT:Hi there!
    USER: Good morning!
    BOT:How are today!
    USER: Do you like machine learning?
    BOT:Yes, I like machine learning
    USER: How do I make a neural network?
    BOTI am sorry, but I do not understand.
    USER: Let us talk about current activities
    BOT:What are working now?
    USER: I am just browsing Internet for news
    BOT:What a waste of time! Dont you have any other things to do?
    USER: I am working on python script to make new chatbot
    BOT:This is great. Keep working on this
    USER: Bye

    During the testing I confirmed that it correctly responds to new (but similar) inputs even if we did not train exactly.

    Once you trained the bot, the result of training is staying there even after turning off/on PC. So if you run program multiple times, you need run training just first time. You still can run training later, for example if you want retrain or update python chatbot code.


    This section describes how to deploy ChatterBot on pythonanywhere web hosting site with Django programming web framework.

    PythonAnywhere is an online integrated development environment (IDE) and web hosting service based on the Python programming language.[3] It has free account that allows to deploy our Chatbot. Django is needed to connect web front interface with our python code for ChatterBot in the backend. Other web frameworks like Flask can be used on pythonanywhere (instead of Django).

    Below is the diagram that is showing setup that will be described below.

    Chatbot online diagram
    Chatbot online diagram

    Here is how ChatterBot can be deployed on pythonanywhere with Django:

  • Go to pythonanywhere and select plan signup
  • Select Create new web framework
  • Select Django
  • Select python version (3.6 was used in this example)
  • Select project name directory
  • It will create the following:

  • create folder cbot under user folder
  • /home/user/cbot

  • inside of this folder create
  • __init__.py (this is just empty file)

    Inside chatbotpy wrap everything into function runbot like below. In this function we are initiating chatbot object, taking user input, if needed we train chatbot and then asking for chatbot response. The response provided by chatbot is the output of this function. The input of this function is the user input that we are getting through web.

    def runbot(inp, train_bot=False):
     from chatterbot import ChatBot
     chatbot = ChatBot("mychatbot",
                "import_path": "chatterbot.logic.BestMatch",
                "statement_comparison_function": "chatterbot.comparisons.levenshtein_distance",
                "response_selection_method": "chatterbot.response_selection.get_first_response"
                'import_path': 'chatterbot.logic.LowConfidenceAdapter',
                'threshold': 0.65,
                'default_response': 'I am sorry, but I do not understand.'
     if (train_bot == True):
        print "Training"
      Insert here training code from the python code for ChatterBot example 
     response = chatbot.get_response(inp)
     return (response)

    Now update views.py like below code box. Here we are taking user input from web, feeding this input to runbot function and sending output of runbot function (which is chatbot reponse) to web template.

    from django.shortcuts import render
    from cbot.chatbotpy import runbot
    def press_my_buttons(request):
        if request.POST:
            conv=request.POST.get('conv', '')
            user_input=request.POST.get('user_input', '')
            conv=conv + "" + str(user_input) + "\n" + "BOT:"+ str(resp) + "\n"
            conv =  "BOT:"+ str(resp) + "\n";
        return render(request, 'my_template.html', {'conv': conv})

    Now update my_template.html like below. Here we just show new response together with previous conversation information.

    <form method="post">
        {% csrf_token %}
        <textarea rows=20 cols=60>{{conv}}</textarea>
        <input type="textbox" name="user_input" value=""/>
        <button type="submit">Submit</button>
        <input type="hidden" name =conv  value="{{conv}}" />

    Now update some configuration.
    Update or confirm manage.py to include the line with settings.

    if __name__ == '__main__':
        os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'cbotdjango.settings')

    Update or confirm urls.py to have path like below

    import sys
    path = "/home/user/cbotdjango"
    if path not in sys.path:

    Now you are ready do testing chatbot online and should see the screen similar in the setup diagram on right side.


    We saw how to train ChatterBot and some functionality of it. We investigated how to install ChatterBot on pythonanywhere with Django web framework. Hope this will make easy to deploy chatbot in case you are going this route. If you have any tips or anything else to add, please leave a comment below.

    1. ChatterBot
    2. Python chatbot code example
    3. Python Anywhere

    Text Clustering with doc2vec Word Embedding Machine Learning Model

    In this post we will look at doc2vec word embedding model, how to build it or use pretrained embedding file. For practical example we will explore how to do text clustering with doc2vec model.


    Doc2vec is an unsupervised computer algorithm to generate vectors for sentence/paragraphs/documents. The algorithm is an adaptation of word2vec which can generate vectors for words. Below you can see frameworks for learning word vector word2vec (left side) and paragraph vector doc2vec (right side). For learning doc2vec, the paragraph vector was added to represent the missing information from the current context and to act as a memory of the topic of the paragraph. [1]

    Word Embeddings Machine Learning Frameworks: word2vec and doc2vec

    If you need information about word2vec here are some posts:
    word2vec –
    Vector Representation of Text – Word Embeddings with word2vec
    word2vec application –
    Text Analytics Techniques with Embeddings
    Using Pretrained Word Embeddinigs in Machine Learning
    K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

    The vectors generated by doc2vec can be used for tasks like finding similarity between sentences / paragraphs / documents. [2] With doc2vec you can get vector for sentence or paragraph out of model without additional computations as you would do it in word2vec, for example here we used function to go from word level to sentence level:
    Text Clustering with Word Embedding in Machine Learning

    word2vec was very successful and it created idea to convert many other specific texts to vector. It can called “anything to vector”. So there are many different word embedding models that like doc2vec can convert more than one word to numeric vector. [3][4] Here are few examples:

    tweet2vec Tweet2Vec: Character-Based Distributed Representations for Social Media
    lda2vec Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec. Here is proposed model that learns dense word vectors jointly with Dirichlet-distributed latent document-level mixtures of topic vectors.
    Topic2Vec Learning Distributed Representations of Topics
    Med2vec Multi-layer Representation Learning for Medical Concepts
    The list can go on. In the next section we will look how to load doc2vec and use for text clustering.

    Building doc2vec Model

    Here is the example for converting word paragraph to vector using own built doc2vec model. The example is taken from [5].

    The script consists of the following main steps:

    • build model using own text
    • save model to file
    • load model from this file
    • infer vector representation
    from gensim.test.utils import common_texts
    from gensim.models.doc2vec import Doc2Vec, TaggedDocument
    print (common_texts)
    [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
    documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
    print (documents)
    [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=[1]), TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=[2]), TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), TaggedDocument(words=['graph', 'trees'], tags=[6]), TaggedDocument(words=['graph', 'minors', 'trees'], tags=[7]), TaggedDocument(words=['graph', 'minors', 'survey'], tags=[8])]
    model = Doc2Vec(documents, size=5, window=2, min_count=1, workers=4)
    #Persist a model to disk:
    from gensim.test.utils import get_tmpfile
    fname = get_tmpfile("my_doc2vec_model")
    print (fname)
    #output: C:\Users\userABC\AppData\Local\Temp\my_doc2vec_model
    #load model from saved file
    model = Doc2Vec.load(fname)  
    # you can continue training with the loaded model!
    #If you’re finished training a model (=no more updates, only querying, reduce memory usage), you can do:
    model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
    #Infer vector for a new document:
    #Here our text paragraph just 2 words
    vector = model.infer_vector(["system", "response"])
    print (vector)
    [-0.08390492  0.01629403 -0.08274432  0.06739668 -0.07021132]

    Using Pretrained doc2vec Model

    We can skip building embedding file step and use already built file. Here is an example how to do coding with pretrained word embedding file for representing test docs as vectors. The script is based on [6].

    The below script is using pretrained on Wikipedia data doc2vec model from this location

    Here is the link where you can find links to different pre-trained doc2vec and word2vec models and additional information.

    You need to download zip file, unzip , put 3 files at some folder and provide path in the script. In this example it is “doc2vec/doc2vec.bin”

    The main steps of the below script consist of just load doc2vec model and infer vectors.

    import gensim.models as g
    import codecs
    #inference hyper-parameters
    #load model
    m = g.Doc2Vec.load(model)
    test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]
    #infer test vectors
    output = open(output_file, "w")
    for d in test_docs:
        output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
    output file
    0.03772797 0.07995503 -0.1598981 0.04817521 0.033129826 -0.06923918 0.12705861 -0.06330753 .........

    So we got output file with vectors (one per each paragraph). That means we successfully converted our text to vectors. Now we can use it for different machine learning algorithms such as text classification, text clustering and many other. Next section will show example for Birch clustering algorithm with word embeddings.

    Using Pretrained doc2vec Model for Text Clustering (Birch Algorithm)

    In this example we use Birch clustering algorithm for clustering text data file from [6]
    Birch is unsupervised algorithm that is used for hierarchical clustering. An advantage of this algorithm is its ability to incrementally and dynamically cluster incoming data [7]

    We use the following steps here:

    • Load doc2vec model
    • Load text docs that will be clustered
    • Convert docs to vectors (infer_vector)
    • Do clustering
    from sklearn import metrics
    import gensim.models as g
    import codecs
    #inference hyper-parameters
    #load model
    m = g.Doc2Vec.load(model)
    test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]
    print (test_docs)
    [['the', 'cardigan', 'welsh', 'corgi'........
    for d in test_docs:
        X.append( m.infer_vector(d, alpha=start_alpha, steps=infer_epoch) )
    from sklearn.cluster import Birch
    brc = Birch(branching_factor=50, n_clusters=k, threshold=0.1, compute_labels=True)
    clusters = brc.predict(X)
    labels = brc.labels_
    print ("Clusters: ")
    print (clusters)
    silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')
    print ("Silhouette_score: ")
    print (silhouette_score)
    [1 0 0 1 1 2 1 0 1 1]

    If you want to get some test with text clustering and word embeddings here is the online demo Currently it is using word2vec and glove models and k means clustering algorithm. Select ‘Text Clustering’ option and scroll down to input data.


    We looked what is doc2vec is, we investigated 2 ways to load this model: we can create embedding model file from our text or use pretrained embedding file. We applied doc2vec to do Birch algorithm for text clustering. In case we need to work with paragraph / sentences / docs, doc2vec can simplify word embedding for converting text to vectors.

    1. Distributed Representations of Sentences and Documents
    2. What is doc2vec?
    3. Anything to Vec
    4. Anything2Vec, or How Word2Vec Conquered NLP
    5. models.doc2vec – Doc2vec paragraph embeddings
    6. doc2vec
    7. BIRCH

    Text Clustering with Word Embedding in Machine Learning

    Text clustering is widely used in many applications such as recommender systems, sentiment analysis, topic selection, user segmentation. Word embeddings (for example word2vec) allow to exploit ordering
    of the words and semantics information from the text corpus. In this blog you can find several posts dedicated different word embedding models:

    GloVe –
    How to Convert Word to Vector with GloVe and Python
    fastText –
    FastText Word Embeddings
    word2vec –
    Vector Representation of Text – Word Embeddings with word2vec
    word2vec application –
    Text Analytics Techniques with Embeddings
    Using Pretrained Word Embeddinigs in Machine Learning
    K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

    In contrast to last post from the above list, in this post we will discover how to do text clustering with word embeddings at sentence (phrase) level. The sentence could be a few words, phrase or paragraph like tweet. For examples we have 1000 of tweets and want to group in several clusters. So each cluster would contain one or more tweets.


    Our data will be the set of sentences (phrases) containing 2 topics as below:
    Note: I highlighted in bold 3 sentences on weather topic, all other sentences have totally different topic.
    sentences = [[‘this’, ‘is’, ‘the’, ‘one’,’good’, ‘machine’, ‘learning’, ‘book’],
    [‘this’, ‘is’, ‘another’, ‘book’],
    [‘one’, ‘more’, ‘book’],
    [‘weather’, ‘rain’, ‘snow’],
    [‘yesterday’, ‘weather’, ‘snow’],
    [‘forecast’, ‘tomorrow’, ‘rain’, ‘snow’],

    [‘this’, ‘is’, ‘the’, ‘new’, ‘post’],
    [‘this’, ‘is’, ‘about’, ‘more’, ‘machine’, ‘learning’, ‘post’],
    [‘and’, ‘this’, ‘is’, ‘the’, ‘one’, ‘last’, ‘post’, ‘book’]]

    Word Embedding Method

    For embeddings we will use gensim word2vec model. There is also doc2vec model – but we will use it at next post.
    With the need to do text clustering at sentence level there will be one extra step for moving from word level to sentence level. For each sentence from the set of sentences, word embedding of each word is summed and in the end divided by number of words in the sentence. So we are getting average of all word embeddings for each sentence and use them as we would use embeddings at word level – feeding to machine learning clustering algorithm such k-means.

    Here is the example of the function that doing this:

    def sent_vectorizer(sent, model):
        sent_vec =[]
        numw = 0
        for w in sent:
                if numw == 0:
                    sent_vec = model[w]
                    sent_vec = np.add(sent_vec, model[w])
        return np.asarray(sent_vec) / numw

    Now we will use text clustering Kmeans algorithm with word2vec model for embeddings. For kmeans algorithm we will use 2 separate implementations with different libraries NLTK for KMeansClusterer and sklearn for cluster. This was described in previous posts (see the list above).

    The code for this article can be found in the end of this post. We use 2 for number of clusters in both k means text clustering algorithms.
    Additionally we will plot data using tSNE.


    Below are results

    [1, 1, 1, 0, 0, 0, 1, 1, 1]

    Cluster id and sentence:
    1:[‘this’, ‘is’, ‘the’, ‘one’, ‘good’, ‘machine’, ‘learning’, ‘book’]
    1:[‘this’, ‘is’, ‘another’, ‘book’]
    1:[‘one’, ‘more’, ‘book’]
    0:[‘weather’, ‘rain’, ‘snow’]
    0:[‘yesterday’, ‘weather’, ‘snow’]
    0:[‘forecast’, ‘tomorrow’, ‘rain’, ‘snow’]

    1:[‘this’, ‘is’, ‘the’, ‘new’, ‘post’]
    1:[‘this’, ‘is’, ‘about’, ‘more’, ‘machine’, ‘learning’, ‘post’]
    1:[‘and’, ‘this’, ‘is’, ‘the’, ‘one’, ‘last’, ‘post’, ‘book’]

    Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):

    Cluster id and sentence:
    1 [‘this’, ‘is’, ‘the’, ‘one’, ‘good’, ‘machine’, ‘learning’, ‘book’]
    1 [‘this’, ‘is’, ‘another’, ‘book’]
    1 [‘one’, ‘more’, ‘book’]
    0 [‘weather’, ‘rain’, ‘snow’]
    0 [‘yesterday’, ‘weather’, ‘snow’]
    0 [‘forecast’, ‘tomorrow’, ‘rain’, ‘snow’]

    1 [‘this’, ‘is’, ‘the’, ‘new’, ‘post’]
    1 [‘this’, ‘is’, ‘about’, ‘more’, ‘machine’, ‘learning’, ‘post’]
    1 [‘and’, ‘this’, ‘is’, ‘the’, ‘one’, ‘last’, ‘post’, ‘book’]

    Results of text clustering
    Results of text clustering

    We see that the data were clustered according to our expectation – different sentences by topic appeared to different clusters. Thus we learned how to do clustering algorithms in data mining or machine learning with word embeddings at sentence level. Here we used kmeans clustering and word2vec embedding model. We created additional function to go from word embeddings to sentence embeddings level. In the next post we will use doc2vec and will not need this function.

    Below is full source code python script.

    from gensim.models import Word2Vec
    from nltk.cluster import KMeansClusterer
    import nltk
    import numpy as np 
    from sklearn import cluster
    from sklearn import metrics
    # training data
    sentences = [['this', 'is', 'the', 'one','good', 'machine', 'learning', 'book'],
                ['this', 'is',  'another', 'book'],
                ['one', 'more', 'book'],
                ['weather', 'rain', 'snow'],
                ['yesterday', 'weather', 'snow'],
                ['forecast', 'tomorrow', 'rain', 'snow'],
                ['this', 'is', 'the', 'new', 'post'],
                ['this', 'is', 'about', 'more', 'machine', 'learning', 'post'],  
                ['and', 'this', 'is', 'the', 'one', 'last', 'post', 'book']]
    model = Word2Vec(sentences, min_count=1)
    def sent_vectorizer(sent, model):
        sent_vec =[]
        numw = 0
        for w in sent:
                if numw == 0:
                    sent_vec = model[w]
                    sent_vec = np.add(sent_vec, model[w])
        return np.asarray(sent_vec) / numw
    for sentence in sentences:
        X.append(sent_vectorizer(sentence, model))   
    print ("========================")
    print (X)
    # note with some version you would need use this (without wv) 
    #  model[model.vocab] 
    print (model[model.wv.vocab])
    print (model.similarity('post', 'book'))
    print (model.most_similar(positive=['machine'], negative=[], topn=2))
    kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
    assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
    print (assigned_clusters)
    for index, sentence in enumerate(sentences):    
        print (str(assigned_clusters[index]) + ":" + str(sentence))
    kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_
    print ("Cluster id labels for inputted data")
    print (labels)
    print ("Centroids data")
    print (centroids)
    print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
    print (kmeans.score(X))
    silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')
    print ("Silhouette_score: ")
    print (silhouette_score)
    import matplotlib.pyplot as plt
    from sklearn.manifold import TSNE
    model = TSNE(n_components=2, random_state=0)
    plt.scatter(Y[:, 0], Y[:, 1], c=assigned_clusters, s=290,alpha=.5)
    for j in range(len(sentences)):    
       plt.annotate(assigned_clusters[j],xy=(Y[j][0], Y[j][1]),xytext=(0,0),textcoords='offset points')
       print ("%s %s" % (assigned_clusters[j],  sentences[j]))

    Topic Modeling Python and Textacy Example

    Topic modeling is automatic discovering the abstract “topics” that occur in a collection of documents.[1] It can be used for providing more informative view of search results, quick overview for set of documents or some other services.


    In this post we will look at topic modeling with textacy. Textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library.
    It can flexibly tokenize and vectorize documents and corpora, then train, interpret, and visualize topic models using LSA, LDA, or NMF methods. [2]
    Textacy is less known than other python libraries such as NLTK, SpaCY, TextBlob [3] But it looks very promising as it’s built on the top of spaCY.

    In this post we will use textacy for the following task. We have group of documents and we want extract topics out of this set of documents. We will use 20 Newsgroups dataset as the source of documents.

    Code Structure

    Our code consist of the following steps:
    Get data. We will use only 2 groups (alt.atheism’, ‘soc.religion.christian’).
    Tokenize and remove some not needed characters or stopwords.
    Extract Topics. Here we do actual topic modeling. We use Non-negative Matrix Factorization method. (NMF)
    Output graph of terms – topic matrix.


    Below is the final output plot.

    Topic modeling with textacy
    Topic modeling with textacy

    Looking at output graph we can see term distribution over the topics. We identified more than 2 topics. For example topic 2 is associated with atheism, while topic 1 is associated with God, religion.

    While better data preparation is needed to remove few more non meaningful words, the example still showing that to do topic modeling with textacy is much easy than with some other modes (for example gensim). This is because it has ability to do many things that you need do after NLP versus just do NLP and allow user then add additional data views, heatmaps or diagrams.

    Here are few links with topic modeling using LDA and gensim (not using textacy). The posts demonstrate that it is required more coding comparing with textacy.
    Topic Extraction from Blog Posts with LSI , LDA and Python
    Data Visualization – Visualizing an LDA Model using Python

    Source Code

    Below is python full source code.

    categories = ['alt.atheism', 'soc.religion.christian'] 
    #Loading the data set - training data.
    from sklearn.datasets import fetch_20newsgroups
    newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, categories=categories, remove=('headers', 'footers', 'quotes'))
    # You can check the target names (categories) and some data files by following commands.
    print (newsgroups_train.target_names) #prints all the categories
    print("\n".join(newsgroups_train.data[0].split("\n")[:3])) #prints first line of the first data file
    print (newsgroups_train.target_names)
    print (len(newsgroups_train.data))
    texts = []
    texts = newsgroups_train.data
    from nltk.corpus import stopwords
    import textacy
    from textacy.vsm import Vectorizer
    terms_list=[[tok  for tok in doc.split() if tok not in stopwords.words('english') ] for doc in texts]
    for doc in terms_list:
     for word in doc:   
       print (word) 
       if word == "|>" or word == "|>" or word == "_" or word == "-" or word == "#":
             terms_list[count].remove (word)
       if word == "=":
             terms_list[count].remove (word)
       if word == ":":
             terms_list[count].remove (word)    
       if word == "_/":
             terms_list[count].remove (word)  
       if word == "I" or word == "A":
             terms_list[count].remove (word)
       if word == "The" or word == "But" or word=="If" or word=="It":
             terms_list[count].remove (word)       
    print ("=====================terms_list===============================")
    print (terms_list)
    vectorizer = Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
    doc_term_matrix = vectorizer.fit_transform(terms_list)
    print ("========================doc_term_matrix)=======================")
    print (doc_term_matrix)
    #initialize and train a topic model:
    model = textacy.tm.TopicModel('nmf', n_topics=20)
    print ("======================model=================")
    print (model)
    doc_topic_matrix = model.transform(doc_term_matrix)
    for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1]):
              print('topic', topic_idx, ':', '   '.join(top_terms))
    for i, val in enumerate(model.topic_weights(doc_topic_matrix)):
         print(i, val)
    print   ("doc_term_matrix")     
    print   (doc_term_matrix)   
    print ("vectorizer.id_to_term")
    print (vectorizer.id_to_term)
    model.termite_plot(doc_term_matrix, vectorizer.id_to_term, topics=-1,  n_terms=25, sort_terms_by='seriation')  

    1.Topic Model
    2.textacy: NLP, before and after spaCy
    3.5 Heroic Python NLP Libraries

    Text Mining Techniques for Search Results Clustering

    Text search box can be found almost in every web based application that has text data. We use search feature when we are looking for customer data, jobs descriptions, book reviews or some other information. Simple keyword matching can be enough in some small tasks. However when we have many results something better than keyword match would be very helpful. Instead of going through a lot of results we would get results grouped by topic with a nice summary of topics. It would allow to see information at first sight.

    In this post we will look in some machine learning algorithms, applications and frameworks that can analyze output of search function and provide useful additional information for search results.

    Machine Learning Clustering for Search Results

    Search results clustering problem is defined as an automatic, on-line grouping of similar documents in a search results list returned from a search engine. [1] Carrot2 is the tool that was built to solve this problem.
    Carrot2 is Open Source Framework for building Search Results Clustering Engine. This tool can do search, cluster and visualize clusters. Which is very cool. I was not able to find similar like this tool in the range of open source projects. If you are aware of such tool, please suggest in the comment box.

    Below are screenshots of clustering search results from Carrot2

    Clustering search results with Carrot2
    Clustering search results with Carrot2
    Aduna cluster map visualization clusters
    Aduna cluster map visualization clusters with Carrot2

    The following algorithms are behind Carrot2 tool:
    Lingo algorithm constructs a “term-document matrix” where each snippet gets a column, each word a row and the values are the frequency of that word in that snippet. It then applies a matrix factorization called singular value decomposition or SVD. [3]

    Suffix Tree Clustering (STC) uses the generalised suffix tree data structure, to efficiently build a list of the most frequently used phrases in the snippets from the search results. [3]

    Topic modelling

    Topic modelling is another approach that is used to identify which topic is discussed in documents or text snippets provided by search function. There are several methods like LSA, pLSA, LDA [11]

    Comprehensive overview of Topic Modeling and its associated techniques is described in [12]

    Topic modeling can be represented via below diagram. Our goal is identify topics given documents with the words

    Topic modeling diagram
    Topic modeling diagram

    Below is plate notation of LDA model.

    Plate notation of LDA model
    Plate notation of LDA model

    Plate notation representing the LDA model. [19]
    αlpha is the parameter of the Dirichlet prior on the per-document topic distributions,
    βeta is the parameter of the Dirichlet prior on the per-topic word distribution,
    p is the topic distribution for document m,
    Z is the topic for the n-th word in document m, and
    W is the specific word.

    We can use different NLP libraries (NLTK, spaCY, gensim, textacy) for topic modeling.
    Here is the example of topic modeling with textacy python library:
    Topic Modeling Python and Textacy Example

    Here are examples of topic modeling with gensim library:
    Topic Extraction from Blog Posts with LSI , LDA and Python
    Data Visualization – Visualizing an LDA Model using Python

    Using Word Embeddings

    Word embeddings like gensim, word2vec, glove showed very good results in NLP and are widely used now. This is also used for search results clustering. The first step would be create model for example gensim. In the next step text data are converted to vector representation. Words embedding improve preformance by leveraging information on how words are semantically correlated to each other [7][10]

    Neural Topic Model (NTM) and Other Approaches

    Below are some other approaches that can be used for topic modeling for search results organizing.
    Neural topic modeling – combines a neural network with a latent topic model. [14]
    Topic modeling with Deep Belief Nets is described in [17]. The concept of the method is to load bag-of-words (BOW) and produce a strong latent representation that will then be used for a content based recommender system. The authors report that model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classification tasks.

    Thus we looked at different techniques for search results clustering. In the future posts we will implement some of them. What machine learning methods do you use for presenting search results? I would love to hear.

    1. Lingo Search Results Clustering Algorithm
    2. Carrot2 Algorithms
    3. Carrot2
    4. Apache SOLR and Carrot2 integration strategies
    5. Topical Clustering of Search Results
    6. K-means clustering for text dataset
    7. Document Clustering using Doc2Vec/word2vec
    8 Automatic Topic Clustering Using Doc2Vec
    9. Search Results Clustering Algorithm
    10. LDA2vec: Word Embeddings in Topic Models
    11. Topic Modelling in Python with NLTK and Gensim
    12. Topic Modeling with LSA, PLSA, LDA & lda2Vec
    13. Text Summarization with Amazon Reviews
    14. A Hybrid Neural Network-Latent Topic Model
    15. docluster
    16. Deep Belief Nets for Topic Modeling
    17. Modeling Documents with a Deep Boltzmann Machine
    18. Beginners guide to topic modeling in python
    19. Latent Dirichlet allocation

    Text Classification of Different Datasets with CNN Convolutional Neural Network and Python

    In this post we explore machine learning text classification of 3 text datasets using CNN Convolutional Neural Network in Keras and python. As reported on papers and blogs over the web, convolutional neural networks give good results in text classification.


    We will use the following datasets:
    1. 20 newsgroups text dataset that is available from scikit learn here.
    2. Dataset of web pages. The web documents are downloaded manually from web and belong to two categories : text mining or hidden markov models (HMM). This is small dataset that consists only of 20 pages for text mining and 11 pages for HMM group.
    3. Datasets of tweets about Year Resolutions, obtained from data.world/crowdflower here.

    Convolutional Neural Network Architecture

    Our CNN will be based on Richard Liao code from [1], [2]. We use convolutional neural network that is built with different layers such as Embedding , Conv1D, Flatten, Dense. For embedding we utilize pretrained glove dataset that can be downloaded from web.

    The data flow diagram with layers used is shown below.

    CNN diagram
    CNN diagram

    Here is the code for obtaining convolutional neural net diagram like this. Insert it after model.fit (…) line. It requires installation of pydot and graphviz however.

    import pydot
    pydot.find_graphviz = lambda: True
    print (pydot.find_graphviz())
    import os
    os.environ["PATH"] += os.pathsep + "C:\\Program Files (x86)\\Graphviz2.38\\bin"
    from keras.utils import plot_model
    plot_model(model, to_file='model.png')

    1D Convolution

    In our neural net convolution is performed in several 1 dimensional convolution layers (Conv1D)
    1D convolution means that just 1-direction is used to calculate convolution.[3]
    For example:
    input = [1,1,1,1,1], filter = [0.25,0.5,0.25], output = [1,1,1,1,1]
    output-shape is 1D array
    We can also apply 1D convolution for 2D data matrix – as we use in text classification.
    The good explanation of convolution in text can be found in [6]

    Text Classifiction of 20 Newsgroups Text Dataset

    For this dataset we use only 2 categories. The script is provided here The accuracy of network is 87%. Trained on 864 samples, validate on 215 samples.
    Summary of run: loss: 0.6205 – acc: 0.6632 – val_loss: 0.5122 – val_acc: 0.8651

    Document classification of Web Pages.

    Here we use also 2 categories. Python script is provided here.

    Web page were manually downloaded from web and saved locally in two folders, one for each category. The script is loading web page files from locale storage. Next is preprocessing step to remove web tags but keep text content. Here is the function for this:

    def get_only_text_from_html_doc(page):
      return the title and the text of the article
     soup = BeautifulSoup(page, "lxml")
     text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
     return soup.title.text + " " + text  

    Accuracy on this dataset was 100% but was not consistent. In some other runs the result was only 83%.
    Trained on 25 samples, validate on 6 samples.
    Summary of run – loss: 0.0096 – acc: 1.0000 – val_loss: 0.0870 – val_acc: 1.0000

    Text Classification of Tweet Dataset

    The script is provided here.
    Here is the accuracy was 93%. Trained on 4010 samples, validate on 1002 samples.
    Summary of run – loss: 0.0193 – acc: 0.9958 – val_loss: 0.6690 – val_acc: 0.9281.


    We learned how to do text classification for 3 different types of text datasets (Newsgroups, tweets, web documents). For text classification we used Convolutional Neural Network python and on all 3 datasets we got good performance on accuracy.


    1. Text Classification, Part I – Convolutional Networks
    2. textClassifierConv
    3. What do you mean by 1D, 2D and 3D Convolutions in CNN?
    4.How to implement Sentiment Analysis using word embedding and Convolutional Neural Networks on Keras.
    5. Understanding Convolutional Neural Networks for NLP
    6. Understanding Convolutions in Text
    7. Recurrent Neural Networks I

    Automatic Text Summarization Online

    In the previous post Automatic Text Summarization with Python I showed how to use different python libraries for text summarization. Recently I added text summarization modules to online site Online Machine Learning Algorithms. So now you can play with text summarization modules online and select best summary generator. This service is the free tool that allows to run some algorithms without coding or installing software modules.

    Below are the steps how to use online text summarizer models of Machine Learning Algorithms tool.

    How to use online text summarizer algorithms

    1. Access the link Online Machine Learning Algorithms : Online Machine Learning Algorithms tool.
    Select text summarization algorithm that you want to run. There is one available with gensim and 3 with sumy python modules. We will use Luhn text summarizer algorithm. The algorithms from gensim and sumy python modules are still widely used in automatic text summarization which is part of the field of natural language processing.

    Running online text summarization step1
    Running online text summarization step1

    2. Input the data that you want to run or click on Load Default Values. Note that you need to enter about 10 sentences at least. It will not work if you enter just few words or just one sentence.

    Running online text summarization step2

    3. Click Run now.

    4. Click View Run Results link.

    Running online text summarization -  example of output
    Running online text summarization – example of output

    5. Click Refresh Page button on this new page , you maybe will need click few times untill data output show up. Usually it takes less than 1 min, but it will depend how much data you need to process.
    Scroll to the bottom page to see results.

    If you try other text summarizers from this online tool you will see that there are some differences in generated text summaries.

    End Notes

    In this post, we covered how to use online text summarizer models of Machine Learning Algorithms tool available here You can run online algorithms from gensim and sumy python modules.
    Feel free to provide comments or suggestions.

    Document Similarity, Tokenization and Word Vectors in Python with spaCY

    Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be done just within few lines. Below you will find how to get document similarity , tokenization and word vectors with spaCY.

    spaCY is an open-source library designed to help you build NLP applications. It has a lot of features, we will look in this post only at few but very useful.

    Document Similarity

    Here is how to get document similarity:

    import spacy
    nlp = spacy.load('en')
    doc1 = nlp(u'Hello this is document similarity calculation')
    doc2 = nlp(u'Hello this is python similarity calculation')
    doc3 = nlp(u'Hi there')
    print (doc1.similarity(doc2)) 
    print (doc2.similarity(doc3)) 
    print (doc1.similarity(doc3))  

    In more realistic situations we would load documents from files and would have longer text. Here is the experiment that I performed. I saved 3 articles from different random sites, two about deep learning and one about feature engineering.

    def get_file_contents(filename):
      with open(filename, 'r') as filehandle:  
        filecontent = filehandle.read()
        return (filecontent) 
    print (fn1_doc)
    print (fn2_doc)
    print (fn3_doc)
    doc1 = nlp(fn1_doc)
    doc2 = nlp(fn2_doc)
    doc3 = nlp(fn3_doc)
    print ("dl1 - features")
    print (doc1.similarity(doc2)) 
    print ("feature - dl")
    print (doc2.similarity(doc3)) 
    print ("dl1 - dl")
    print (doc1.similarity(doc3)) 
    dl1 - features
    feature - dl
    dl1 - dl

    It was able to assign higher similarity score for documents with similar topics!


    Another very useful and simple feature that can be done with spaCY is tokenization. Here is how easy to convert text into tokens (words):

    for token in doc1:
        print (token.vector)

    Word Vectors

    spaCY has integrated word vectors support, while other libraries like NLTK do not have it. Below line will print word embeddings – array of 768 numbers on my environment.

    print (token.vector)   #-  prints word vector form of token. 
    print (doc1[0].vector) #- prints word vector form of first token of document.
    print (doc1.vector)    #- prints mean vector form for doc1

    So we looked how to use few features (similarity, tokenization and word embeddings) which are very easy to implement with spaCY. I hope you enjoyed this post. If you have any tips or anything else to add, please leave a comment below.

    1. spaCY
    2. Word Embeddings in Python with Spacy and Gensim