7+ Best Online Resources for Text Preprocessing for Machine Learning Algorithms

With advance of machine learning , natural language processing and increasing available information on the web, the use of text data in machine learning algorithms is growing. The important step in using text data is preprocessing original raw text data. The data preparation steps may include the following:

Tokenization
Removing punctuation
Removing stop words
Stemming
Word Embedding
Named-entity recognition (NER)
Coreference resolution – finding all expressions that refer to the same entity in a text

Recently created new articles on this topic, greatly expanded examples of text preprocessing operations. In this post we collect and review online articles that are describing text prepocessing techniques with python code examples.

1. textcleaner

Text-Cleaner is a utility library for text-data pre-processing. It can be used before passing the text data to a model. textcleaner uses a open source projects such as NLTK – for advanced cleaning, REGEX – for regular expression.

Features:

main_cleaner does all the below in one call

remove unnecessary blank lines

transfer all characters to lowercase if needed

remove numbers, particular characters (if needed), symbols and stop-words from the whole text

tokenize the text-data on one call

stemming & lemmatization powered by NLTK

textcleaner is saving time by providing basic cleaning functionality and allowing developer to focus on building machine learning model. The nice thing is that it can do many text processing steps in one call.

Here is the example how to use:

import textcleaner as tc

f="C:\\textinputdata.txt"
out=tc.main_cleaner(f)
print (out)

"""
input text:
The house235 is very small!!
the city is nice.
I was in that city 10 days ago.
The city2 is big.


output text:
[['hous', 'small'], ['citi', 'nice'], ['citi', 'day', 'ago'], ['citi', 'big']]
"""

2. Guide for Text Preprocessing from Analytics Vidhya

Analytics Vidhya regularly provides great practical resources about AI, ML, Analytics. In this ‘Ultimate guide to deal with Text Data’ you can find description of text preprocessing steps with python code. Different python libraries are utilized for solving text preprocessing tasks:
NLTK – for stop list, stemming

TextBlob – for spelling correction, tokenization, lemmatization. TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

gensim – for word embeddings

sklearn – for feature_extraction with TF-IDF

The guide is covering text processing steps from basic to advanced.
Basic steps :

Lower casing
Punctuation, stopwords, frequent and rare words removal
Spelling correction
Tokenization
Stemming
Lemmatization

Advance Text Processing

N-grams
Term, Inverse Document Frequency
Term Frequency-Inverse Document Frequency (TF-IDF)
Bag of Words
Sentiment Analysis
Word Embedding

3. Guide to Natural Language Processing

Often we extract text data from the web and we need strip out HTML before feeding to ML algotithms.
Dipanjan (DJ) Sarkar in his post ‘A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text’ is showing how to do this.

Here we can find project for downloading html text with beatifulsoup python library, extracting useful text from html, doing part analysis, sentiment analysis and NER.
In this post we can find the foolowing text processing python libraries for machine learning :
spacy – spaCy now features new neural models for tagging, parsing and entity recognition (in v2.0)
nltk – leading platform for building Python programs for natural language processing.

Basic text preprocessing steps covered:

Removing HTML tags
Removing accented characters, Special Characters, Stopwords
Expanding Contractions
Stemming
Lemmatization

In addition to above basic steps the guide is also covering parsing techniques for understanding the structure and syntax of language that includes

Parts of Speech (POS) Tagging
Shallow Parsing or Chunking
Constituency Parsing
Dependency Parsing
Named Entity Recognition

4. Natural Language Processing

In this article ‘Natural Language Processing is Fun’ you will find descriptions on the text pre-processing steps:

Sentence Segmentation
Word Tokenization
Predicting Parts of Speech for Each Token
Text Lemmatization
Identifying Stop Words
Dependency Parsing
Named Entity Recognition (NER)
Coreference Resolution

The article explains thoroughly how computers understand textual data by dividing text processing into the above steps. Diagrams help understand concepts very easy. The steps above constitute natural language processing text pipeline and it turn out that with the spacy you can do most of them with only few lines.

Here is the example of using spacy:

import spacy

# Load the large English NLP model
nlp = spacy.load('en_core_web_lg')


f="C:\\Users\\pythonrunfiles\\textinputdata.txt"

with open(f) as ftxt:
     text = ftxt.read()
     
print (text)     


# Parse the text with spaCy.
doc = nlp(text)


for token in doc:
    print(token.text)
    
    
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop) 
    
    
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)



Partial output of above program: 
....
I
was
in
that
city
10
days
ago
.
....
I -PRON- PRON PRP nsubj X True False
was be VERB VBD ROOT xxx True False
in in ADP IN prep xx True False
that that DET DT det xxxx True False
city city NOUN NN pobj xxxx True False
10 10 NUM CD nummod dd False False
days day NOUN NNS npadvmod xxxx True False
ago ago ADV RB advmod xxx True False
. . PUNCT . punct . False False
....
10 days ago 66 77 DATE

5. Learning from Text Summarization Project

This is project ‘Text Summarization with Amazon Reviews’ where review are about food, but the first part contains text preprocessing steps. The preprocessing steps include converting to lowercase, replacing contractions with their longer forms, removing unwanted characters.

For removing contractions author is using a list of contractions from stackoverflow
http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
Using the list and the code from this link, we can replace, for example:
you’ve with you have
she’s with she is

6. Text Preprocessing Methods for Deep Learning

This is a primer on word2vec embeddings but it includes basic preprocessing techniques for text data such as

Cleaning Special Characters and Removing Punctuations
Cleaning Numbers
Removing Misspells
Removing Contractions

7. Text Preprocessing in Python

This is another great resource about text preprocessing steps with python. In addition to basic steps, we can find here how to do collocation extraction, relationship extraction and NER. The paper has many links to other articles on text preprocessing techniques.

Also this paper has comparison of many different natural language processing toolkits like NLTK, Spacy by features, programming language, license. The table has the links to project for text processing toolkit. So it is very handy information where you can find description of text processing steps, tools used, examples of using and link to many other resources.

Conclusion

The above resources show how to perform textual data preprocessing from basic step to advanced, with different python libraries. Below you can find the above links and few more links to resources on the same topic.
Feel free to provide feedback, comments, links to resources that are not mentioned here.

References

1. textcleaner
2. Ultimate guide to deal with Text Data (using Python) – for Data Scientists & Engineers
3. A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text
4. Natural Language Processing is Fun
5. Text Summarization with Amazon Reviews
6. NLP Learning Series: Text Preprocessing Methods for Deep Learning
7. Text Preprocessing in Python: Steps, Tools, and Examples
8. Text Data Preprocessing: A Walkthrough in Python
9. Text Preprocessing, Keras Documentation
10. What is the best way to remove accents in a Python unicode string?
11. PREPROCESSING DATA FOR NLP
12. Processing Raw Text
13. TextBlob: Simplified Text Processing