FastText Word Embeddings for Text Classification with MLP and Python

Word embeddings are widely used now in many text applications or natural language processing moddels. In the previous posts I showed examples how to use word embeddings from word2vec Google, glove models for different tasks including machine learning clustering:

GloVe – How to Convert Word to Vector with GloVe and Python

word2vec – Vector Representation of Text – Word Embeddings with word2vec

word2vec application – K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

In this post we will look at fastText word embeddings in machine learning. You will learn how to load pretrained fastText, get text embeddings and do text classification. As stated on fastText site – text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies. [1]

What is fastText

fastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. [1]

fastText, is created by Facebook’s AI Research (FAIR) lab. The model is an unsupervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages.[2]

As per Quora [6], Fasttext treats each word as composed of character ngrams. So the vector for a word is made of the sum of this character n grams. Word2vec (and glove) treat words as the smallest unit to train on. This means that fastText can generate better word embeddings for rare words. Also fastText can generate word embeddings for out of vocabulary word but word2vec and glove can not do this.

Word Embeddings File

I downloaded wiki file wiki-news-300d-1M.vec from here [4], but there are some other links where you can download different data files. I found this one has smaller size so it is easy to work with it.

Basic Operations with fastText Word Embeddings

To get most similar words to some word:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
print (model.most_similar('desk'))

"""
[('desks', 0.7923153638839722), ('Desk', 0.6869951486587524), ('desk.', 0.6602819561958313), ('desk-', 0.6187258958816528), ('credenza', 0.5955315828323364), ('roll-top', 0.5875717401504517), ('rolltop', 0.5837830305099487), ('bookshelf', 0.5758029222488403), ('Desks', 0.5755287408828735), ('sofa', 0.5617446899414062)]
"""

Load words in vocabulary:

words = []
for word in model.vocab:
    words.append(word)

To see embeddings:

print("Vector components of a word: {}".format(
    model[words[0]]
))

"""
Vector components of a word: [-0.0451  0.0052  0.0776 -0.028   0.0289  0.0449  0.0117 -0.0333  0.1055
 .......................................
 -0.1368 -0.0058 -0.0713]
"""

The Problem

So here we will use fastText word embeddings for text classification of sentences. For this classification we will use sklean Multi-layer Perceptron classifier (MLP).
The sentences are prepared and inserted into script:

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'machine', 'learning', 'book'],
			['one', 'more', 'new', 'book'],
		
          ['this', 'is', 'about', 'machine', 'learning', 'post'],
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
          ['this', 'is', 'the', 'last', 'machine', 'learning', 'book'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'packages'],
          ['orange', 'juice', 'is', 'liquid', 'extract', 'from', 'fruit', 'on', 'orange', 'tree']]

The sentences belong to two classes, the labels for classes will be assigned later as 0,1. So our problem is to classify above sentences. Below is the flowchart of the program that we will use for perceptron learning algorithm example.

Text classification using word embeddings
Text classification using word embeddings

Data Preparation

I converted this text input into digital using the following code. Basically I got word embedidings and averaged all words in the sentences. The resulting vector sentence representations were saved to array V.

import numpy as np

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
   
    return np.asarray(sent_vec) / numw


V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))   

After converting text into vectors we can divide data into training and testing datasets and attach class labels.

X_train = V[0:6]
X_test = V[6:9] 
          
Y_train = [0, 0, 0, 0, 1,1]
Y_test =  [0,1,1]   

Text Classification

Now it is time to load data to MLP Classifier to do text classification.

from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier(alpha = 0.7, max_iter=400) 
classifier.fit(X_train, Y_train)

df_results = pd.DataFrame(data=np.zeros(shape=(1,3)), columns = ['classifier', 'train_score', 'test_score'] )
train_score = classifier.score(X_train, Y_train)
test_score = classifier.score(X_test, Y_test)

print  (classifier.predict_proba(X_test))
print  (classifier.predict(X_test))

df_results.loc[1,'classifier'] = "MLP"
df_results.loc[1,'train_score'] = train_score
df_results.loc[1,'test_score'] = test_score

print(df_results)
     
"""
Output
  classifier  train_score  test_score
         MLP          1.0         1.0
"""

In this post we learned how to use pretrained fastText word embeddings for converting text data into vector model. We also looked how to load word embeddings into machine learning algorithm. And in the end of post we looked at machine learning text classification using MLP Classifier with our fastText word embeddings. You can find full python source code and references below.

from gensim.models import KeyedVectors
import pandas as pd

model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')
print (model.most_similar('desk'))

words = []
for word in model.vocab:
    words.append(word)

print("Vector components of a word: {}".format(
    model[words[0]]
))
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'machine', 'learning', 'book'],
			['one', 'more', 'new', 'book'],
	    ['this', 'is', 'about', 'machine', 'learning', 'post'],
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
          ['this', 'is', 'the', 'last', 'machine', 'learning', 'book'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'packages'],
          ['orange', 'juice', 'is', 'liquid', 'extract', 'from', 'fruit', 'on', 'orange', 'tree']]
         
import numpy as np

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
   
    return np.asarray(sent_vec) / numw

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))   
         
    
X_train = V[0:6]
X_test = V[6:9] 
Y_train = [0, 0, 0, 0, 1,1]
Y_test =  [0,1,1]    
    
    
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier(alpha = 0.7, max_iter=400) 
classifier.fit(X_train, Y_train)

df_results = pd.DataFrame(data=np.zeros(shape=(1,3)), columns = ['classifier', 'train_score', 'test_score'] )
train_score = classifier.score(X_train, Y_train)
test_score = classifier.score(X_test, Y_test)

print  (classifier.predict_proba(X_test))
print  (classifier.predict(X_test))

df_results.loc[1,'classifier'] = "MLP"
df_results.loc[1,'train_score'] = train_score
df_results.loc[1,'test_score'] = test_score
print(df_results)

References
1. fasttext.cc
2. fastText
3.
Classification with scikit learn
4. english-vectors
5. How to use pre-trained word vectors from Facebook’s fastText
6. What is the main difference between word2vec and fastText?

How to Convert Word to Vector with GloVe and Python

In the previous post we looked at Vector Representation of Text with word embeddings using word2vec. Another approach that can be used to convert word to vector is to use GloVe – Global Vectors for Word Representation. Per documentation from home page of GloVe [1] “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus”. Thus we can convert word to vector using GloVe.

At this post we will look how to use pretrained GloVe data file that can be downloaded from [1].
word embeddings GloVe We will look how to get word vector representation from this downloaded datafile. We will also look how to get nearest words. Why do we need vector representation of text? Because this is what we input to machine learning or data science algorithms – we feed numerical vectors to algorithms such as text classification, machine learning clustering or other text analytics algorithms.

Loading Glove Datafile

The code that I put here is based on some examples that I found on StackOverflow [2].

So first you need to open the file and load data into the model. Then you can get the vector representation and other things.

Below is the full source code for glove python script:

file = "C:\\Users\\glove\\glove.6B.50d.txt"
import numpy as np
def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
   
    
    with open(gloveFile, encoding="utf8" ) as f:
       content = f.readlines()
    model = {}
    for line in content:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model
    
    
model= loadGloveModel(file)   

print (model['hello'])

"""
Below is the output of the above code
Loading Glove Model
Done. 400000  words loaded!
[-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
  0.67204 ]
"""  

So we got numerical representation of word ‘hello’.
We can use also pandas to load GloVe file. Below are functions for loading with pandas and getting vector information.

import pandas as pd
import csv

words = pd.read_table(file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)


def vec(w):
  return words.loc[w].as_matrix()
 

print (vec('hello'))    #this will print same as print (model['hello'])  before
 

Finding Closest Word or Words

Now how do we find closest word to word “table”? We iterate through pandas dataframe, find deltas and then use numpy argmin function.
The closest word to some word will be always this word itself (as delta = 0) so I needed to drop the word ‘table’ and also next closest word ‘tables’. The final output for the closest word was “place”

words = words.drop("table", axis=0)  
words = words.drop("tables", axis=0)  

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name 


print (find_closest_word(model['table']))
#output:  place

#If we want retrieve more than one closest words here is the function:

def find_N_closest_word(v, N, words):
  Nwords=[]  
  for w in range(N):  
     diff = words.as_matrix() - v
     delta = np.sum(diff * diff, axis=1)
     i = np.argmin(delta)
     Nwords.append(words.iloc[i].name)
     words = words.drop(words.iloc[i].name, axis=0)
    
  return Nwords
  
  
print (find_N_closest_word(model['table'], 10, words)) 

#Output:
#['table', 'tables', 'place', 'sit', 'set', 'hold', 'setting', 'here', 'placing', 'bottom']

We can also use gensim word2vec library functionalities after we load GloVe file.

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file=file, word2vec_output_file="gensim_glove_vectors.txt")

###Finally, read the word2vec txt to a gensim model using KeyedVectors:

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

Difference between word2vec and GloVe

Both models learn geometrical encodings (vectors) of words from their co-occurrence information. They differ in the way how they learn this information. word2vec is using a “predictive” model (feed-forward neural network), whereas GloVe is using a “count-based” model (dimensionality reduction on the co-occurrence counts matrix). [3]

I hope you enjoyed reading this post about how to convert word to vector with GloVe and python. If you have any tips or anything else to add, please leave a comment below.

References
1. GloVe: Global Vectors for Word Representation
2. Load pretrained glove vectors in python
3. How is GloVe different from word2vec
4. Don’t count, predict! A systematic comparison of
context-counting vs. context-predicting semantic vectors

5. Words Embeddings