With advance of machine learning , natural language processing and increasing available information on the web, the use of text data in machine learning algorithms is growing. The important step in using text data is preprocessing original raw text data. The data preparation steps may include the following:
- Removing punctuation
- Removing stop words
- Word Embedding
- Named-entity recognition (NER)
- Coreference resolution – finding all expressions that refer to the same entity in a text
Recently created new articles on this topic, greatly expanded examples of text preprocessing operations. In this post we collect and review online articles that are describing text prepocessing techniques with python code examples.
Text-Cleaner is a utility library for text-data pre-processing. It can be used before passing the text data to a model. textcleaner uses a open source projects such as NLTK – for advanced cleaning, REGEX – for regular expression.
- main_cleaner does all the below in one call
textcleaner is saving time by providing basic cleaning functionality and allowing developer to focus on building machine learning model. The nice thing is that it can do many text processing steps in one call.
Here is the example how to use:
import textcleaner as tc f="C:\\textinputdata.txt" out=tc.main_cleaner(f) print (out) """ input text: The house235 is very small!! the city is nice. I was in that city 10 days ago. The city2 is big. output text: [['hous', 'small'], ['citi', 'nice'], ['citi', 'day', 'ago'], ['citi', 'big']] """
Analytics Vidhya regularly provides great practical resources about AI, ML, Analytics. In this ‘Ultimate guide to deal with Text Data’ you can find description of text preprocessing steps with python code. Different python libraries are utilized for solving text preprocessing tasks:
NLTK – for stop list, stemming
TextBlob – for spelling correction, tokenization, lemmatization. TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
gensim – for word embeddings
sklearn – for feature_extraction with TF-IDF
The guide is covering text processing steps from basic to advanced.
Basic steps :
- Lower casing
- Punctuation, stopwords, frequent and rare words removal
- Spelling correction
Advance Text Processing
- Term, Inverse Document Frequency
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Bag of Words
- Sentiment Analysis
- Word Embedding
Often we extract text data from the web and we need strip out HTML before feeding to ML algotithms.
Dipanjan (DJ) Sarkar in his post ‘A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text’ is showing how to do this.
Here we can find project for downloading html text with beatifulsoup python library, extracting useful text from html, doing part analysis, sentiment analysis and NER.
In this post we can find the foolowing text processing python libraries for machine learning :
spacy – spaCy now features new neural models for tagging, parsing and entity recognition (in v2.0)
nltk – leading platform for building Python programs for natural language processing.
Basic text preprocessing steps covered:
- Removing HTML tags
- Removing accented characters, Special Characters, Stopwords
- Expanding Contractions
In addition to above basic steps the guide is also covering parsing techniques for understanding the structure and syntax of language that includes
- Parts of Speech (POS) Tagging
- Shallow Parsing or Chunking
- Constituency Parsing
- Dependency Parsing
- Named Entity Recognition
In this article ‘Natural Language Processing is Fun’ you will find descriptions on the text pre-processing steps:
- Sentence Segmentation
- Word Tokenization
- Predicting Parts of Speech for Each Token
- Text Lemmatization
- Identifying Stop Words
- Dependency Parsing
- Named Entity Recognition (NER)
- Coreference Resolution
The article explains thoroughly how computers understand textual data by dividing text processing into the above steps. Diagrams help understand concepts very easy. The steps above constitute natural language processing text pipeline and it turn out that with the spacy you can do most of them with only few lines.
Here is the example of using spacy:
import spacy # Load the large English NLP model nlp = spacy.load('en_core_web_lg') f="C:\\Users\\pythonrunfiles\\textinputdata.txt" with open(f) as ftxt: text = ftxt.read() print (text) # Parse the text with spaCy. doc = nlp(text) for token in doc: print(token.text) for token in doc: print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop) for ent in doc.ents: print(ent.text, ent.start_char, ent.end_char, ent.label_) Partial output of above program: .... I was in that city 10 days ago . .... I -PRON- PRON PRP nsubj X True False was be VERB VBD ROOT xxx True False in in ADP IN prep xx True False that that DET DT det xxxx True False city city NOUN NN pobj xxxx True False 10 10 NUM CD nummod dd False False days day NOUN NNS npadvmod xxxx True False ago ago ADV RB advmod xxx True False . . PUNCT . punct . False False .... 10 days ago 66 77 DATE
This is project ‘Text Summarization with Amazon Reviews’ where review are about food, but the first part contains text preprocessing steps. The preprocessing steps include converting to lowercase, replacing contractions with their longer forms, removing unwanted characters.
For removing contractions author is using a list of contractions from stackoverflow
Using the list and the code from this link, we can replace, for example:
you’ve with you have
she’s with she is
This is a primer on word2vec embeddings but it includes basic preprocessing techniques for text data such as
- Cleaning Special Characters and Removing Punctuations
- Cleaning Numbers
- Removing Misspells
- Removing Contractions
This is another great resource about text preprocessing steps with python. In addition to basic steps, we can find here how to do collocation extraction, relationship extraction and NER. The paper has many links to other articles on text preprocessing techniques.
Also this paper has comparison of many different natural language processing toolkits like NLTK, Spacy by features, programming language, license. The table has the links to project for text processing toolkit. So it is very handy information where you can find description of text processing steps, tools used, examples of using and link to many other resources.
The above resources show how to perform textual data preprocessing from basic step to advanced, with different python libraries. Below you can find the above links and few more links to resources on the same topic.
Feel free to provide feedback, comments, links to resources that are not mentioned here.
2. Ultimate guide to deal with Text Data (using Python) – for Data Scientists & Engineers
3. A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text
4. Natural Language Processing is Fun
5. Text Summarization with Amazon Reviews
6. NLP Learning Series: Text Preprocessing Methods for Deep Learning
7. Text Preprocessing in Python: Steps, Tools, and Examples
8. Text Data Preprocessing: A Walkthrough in Python
9. Text Preprocessing, Keras Documentation
10. What is the best way to remove accents in a Python unicode string?
11. PREPROCESSING DATA FOR NLP
12. Processing Raw Text
13. TextBlob: Simplified Text Processing