7+ Best Online Resources for Text Preprocessing for Machine Learning Algorithms

With advance of machine learning , natural language processing and increasing available information on the web, the use of text data in machine learning algorithms is growing. The important step in using text data is preprocessing original raw text data. The data preparation steps may include the following:

  • Tokenization
  • Removing punctuation
  • Removing stop words
  • Stemming
  • Word Embedding
  • Named-entity recognition (NER)
  • Coreference resolution – finding all expressions that refer to the same entity in a text

Recently created new articles on this topic, greatly expanded examples of text preprocessing operations. In this post we collect and review online articles that are describing text prepocessing techniques with python code examples.

1. textcleaner

Text-Cleaner is a utility library for text-data pre-processing. It can be used before passing the text data to a model. textcleaner uses a open source projects such as NLTK – for advanced cleaning, REGEX – for regular expression.

Features:

  • main_cleaner does all the below in one call
  • remove unnecessary blank lines
  • transfer all characters to lowercase if needed
  • remove numbers, particular characters (if needed), symbols and stop-words from the whole text
  • tokenize the text-data on one call
  • stemming & lemmatization powered by NLTK
  • textcleaner is saving time by providing basic cleaning functionality and allowing developer to focus on building machine learning model. The nice thing is that it can do many text processing steps in one call.

    Here is the example how to use:

    import textcleaner as tc
    
    f="C:\\textinputdata.txt"
    out=tc.main_cleaner(f)
    print (out)
    
    """
    input text:
    The house235 is very small!!
    the city is nice.
    I was in that city 10 days ago.
    The city2 is big.
    
    
    output text:
    [['hous', 'small'], ['citi', 'nice'], ['citi', 'day', 'ago'], ['citi', 'big']]
    """
    

    2. Guide for Text Preprocessing from Analytics Vidhya

    Analytics Vidhya regularly provides great practical resources about AI, ML, Analytics. In this ‘Ultimate guide to deal with Text Data’ you can find description of text preprocessing steps with python code. Different python libraries are utilized for solving text preprocessing tasks:
    NLTK – for stop list, stemming

    TextBlob – for spelling correction, tokenization, lemmatization. TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

    gensim – for word embeddings

    sklearn – for feature_extraction with TF-IDF

    The guide is covering text processing steps from basic to advanced.
    Basic steps :

    • Lower casing
    • Punctuation, stopwords, frequent and rare words removal
    • Spelling correction
    • Tokenization
    • Stemming
    • Lemmatization

    Advance Text Processing

    • N-grams
    • Term, Inverse Document Frequency
    • Term Frequency-Inverse Document Frequency (TF-IDF)
    • Bag of Words
    • Sentiment Analysis
    • Word Embedding

    3. Guide to Natural Language Processing  

    Often we extract text data from the web and we need strip out HTML before feeding to ML algotithms.
    Dipanjan (DJ) Sarkar in his post ‘A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text’ is showing how to do this.

    Here we can find project for downloading html text with beatifulsoup python library, extracting useful text from html, doing part analysis, sentiment analysis and NER.
    In this post we can find the foolowing text processing python libraries for machine learning :
    spacy – spaCy now features new neural models for tagging, parsing and entity recognition (in v2.0)
    nltk – leading platform for building Python programs for natural language processing.

    Basic text preprocessing steps covered:

    • Removing HTML tags
    • Removing accented characters, Special Characters, Stopwords
    • Expanding Contractions
    • Stemming
    • Lemmatization

    In addition to above basic steps the guide is also covering parsing techniques for understanding the structure and syntax of language that includes

    • Parts of Speech (POS) Tagging
    • Shallow Parsing or Chunking
    • Constituency Parsing
    • Dependency Parsing
    • Named Entity Recognition

    4. Natural Language Processing

    In this article ‘Natural Language Processing is Fun’ you will find descriptions on the text pre-processing steps:

    • Sentence Segmentation
    • Word Tokenization
    • Predicting Parts of Speech for Each Token
    • Text Lemmatization
    • Identifying Stop Words
    • Dependency Parsing
    • Named Entity Recognition (NER)
    • Coreference Resolution

    The article explains thoroughly how computers understand textual data by dividing text processing into the above steps. Diagrams help understand concepts very easy. The steps above constitute natural language processing text pipeline and it turn out that with the spacy you can do most of them with only few lines.

    Here is the example of using spacy:

    import spacy
    
    # Load the large English NLP model
    nlp = spacy.load('en_core_web_lg')
    
    
    f="C:\\Users\\pythonrunfiles\\textinputdata.txt"
    
    with open(f) as ftxt:
         text = ftxt.read()
         
    print (text)     
    
    
    # Parse the text with spaCy.
    doc = nlp(text)
    
    
    for token in doc:
        print(token.text)
        
        
    for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
              token.shape_, token.is_alpha, token.is_stop) 
        
        
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
    
    
    Partial output of above program: 
    ....
    I
    was
    in
    that
    city
    10
    days
    ago
    .
    ....
    I -PRON- PRON PRP nsubj X True False
    was be VERB VBD ROOT xxx True False
    in in ADP IN prep xx True False
    that that DET DT det xxxx True False
    city city NOUN NN pobj xxxx True False
    10 10 NUM CD nummod dd False False
    days day NOUN NNS npadvmod xxxx True False
    ago ago ADV RB advmod xxx True False
    . . PUNCT . punct . False False
    ....
    10 days ago 66 77 DATE
    

    5. Learning from Text Summarization Project

    This is project ‘Text Summarization with Amazon Reviews’ where review are about food, but the first part contains text preprocessing steps. The preprocessing steps include converting to lowercase, replacing contractions with their longer forms, removing unwanted characters.

    For removing contractions author is using a list of contractions from stackoverflow
    http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
    Using the list and the code from this link, we can replace, for example:
    you’ve with you have
    she’s with she is

    6. Text Preprocessing Methods for Deep Learning

    This is a primer on word2vec embeddings but it includes basic preprocessing techniques for text data such as

    • Cleaning Special Characters and Removing Punctuations
    • Cleaning Numbers
    • Removing Misspells
    • Removing Contractions

    7. Text Preprocessing in Python

    This is another great resource about text preprocessing steps with python. In addition to basic steps, we can find here how to do collocation extraction, relationship extraction and NER. The paper has many links to other articles on text preprocessing techniques.

    Also this paper has comparison of many different natural language processing toolkits like NLTK, Spacy by features, programming language, license. The table has the links to project for text processing toolkit. So it is very handy information where you can find description of text processing steps, tools used, examples of using and link to many other resources.

    Conclusion

    The above resources show how to perform textual data preprocessing from basic step to advanced, with different python libraries. Below you can find the above links and few more links to resources on the same topic.
    Feel free to provide feedback, comments, links to resources that are not mentioned here.

    References

    1. textcleaner
    2. Ultimate guide to deal with Text Data (using Python) – for Data Scientists & Engineers
    3. A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text
    4. Natural Language Processing is Fun
    5. Text Summarization with Amazon Reviews
    6. NLP Learning Series: Text Preprocessing Methods for Deep Learning
    7. Text Preprocessing in Python: Steps, Tools, and Examples
    8. Text Data Preprocessing: A Walkthrough in Python
    9. Text Preprocessing, Keras Documentation
    10. What is the best way to remove accents in a Python unicode string?
    11. PREPROCESSING DATA FOR NLP
    12. Processing Raw Text
    13. TextBlob: Simplified Text Processing