Text analytics techniques involve application of natural language processing (NLP) and text mining machine learning methods such as text classification, clustering, summarization , information extraction and sentiment analysis.
We can view text analytics as the process of getting meaningful information from unstructured text. For example from online discussions we want extract user opinion about product.
Bag of Words
Computers do not understand text. So text mining programs map text data into vectors represented by real numbers. Traditional approach is using counting of words in documents to convert text into the vectors. The well known and widely used model with this approach is bag of words. With this model we have one dimension per unique word.
Word Embeddings
As per many research papers, despite of simplicity bag of words model is very effective. However it is not using position of word in the text relatively to other words. This information can help extract semantic meaning of word because the words in similar position should have similar meanings. [4]
The famous quotation (1957) “You shall know a word by the company it keeps” confirms the importance of word context. This quotation belongs to an English linguist J. R. Firth – leading figure in British linguistics during the 1950s. [1]
To capture context-dependent nature of meaning the word embedding techniques was created. Word embedding is the collective name for a set of language modeling and feature learning techniques. This techniques allow to map words or phrases from the vocabulary to vectors of real numbers.
It involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension. We can use neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models to generate this mapping. [2]
Such distributed representations of words in a vector space help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words. [3]
How Text Analytics Techniques Can Use Word Embeddings
Once we have word embeddings we feed vector representation of words into algorithms that are used by text analytics techniques.
For example here K Means Clustering Example with Word2Vec is the very basic example where the sequence of words was embedded with gensim word2vec and then the results where inputted into machine learning clustering algorithm.
Word embeddings can be saved after they are learned. We can use also word embeddings that were obtained on different vocabulary. Here Using Pretrained Word Embeddinigs in Machine Learning is the example how to load word embeddings provided by Google.
References
1. John Rupert Firth
2. Word Embedding
3. Distributed Representations of Words and Phrases and their Compositionality
4. Distributed Representations of Sentences and Documents