Vector Representation of Text – Word Embeddings with word2vec

Computers can not understand the text. We need to convert text into numerical vectors before any kind of text analysis like text clustering or classification. The classical well known model is bag of words (BOW). With this model we have one dimension per each unique word in vocabulary. We represent the document as vector with 0s and 1s. We use 1 if the word from vocabulary exists in the document.

Recently new models with word embedding in machine learning gained popularity since they allow to keep semantic information. With word embeddings we can get lower dimensionality than with BOW model. There are several such models for example Glove, word2vec that are used in machine learning text analysis.

Many examples on the web are showing how to operate at word level with word embeddings methods but in the most cases we are working at the document level (sentence, paragraph or document) To get understanding how it can be used for text analytics I decided to take word2vect and create small practical example.

In this post you will learn how to use word embedding word2vect method for converting sentence into numerical vector. The same technique can be used for text with more than one sentence. We will create python script that converts sentences into numerical vectors.

Input

For the input for this script we will use hard coded in the script sentences. The sentences in the script will be already tokenized. Below you can find sentences for our input. Note that sentences 6 and 7 are more distinguish from other sentences.

1 [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
2 ['this', 'is',  'another', 'book'],
3 ['one', 'more', 'book'],
4 ['this', 'is', 'the', 'new', 'post'],
5 ['this', 'is', 'about', 'machine', 'learning', 'post'], 
6 ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
7 ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
8 ['and', 'this', 'is', 'the', 'last', 'post']]

With word2vec you have two options:
1. Create your own word2vec
2. Use pretrained data from Google

From word to sentence

Each word in word embeddings is represented by the vector. But let’s say we are working with tweets from twitter and need to know how similar or dissimilar are tweets? So we need to have vector representation of whole text in tweet. To achieve this we can do average word embeddings for each word in sentence (or tweet or paragraph) The idea come from paper [1]. In this paper the authors averaged word embeddings to get paragraph vector.

Source code for conversion


Below in Listing A and Listing B you can find how we can average word embeddings and get numerical vectors.
Listing A has the python source code for using own word embeddings.
Listing B has the python source code for using word embeddings from Google.
The script is taking embeddings from local file that was downloaded from Google before. You can find in this post Using Pretrained Word Embeddings in Machine Learning more details on downloading word embeddings from Google.

When averaging embeddings I was using 50 first dimensions. This is the minimal number that was used in one of the papers. The recommendation is to use between 100-400 dimensions.

Analysis of Results

How do we know that our results are good? We will do here a quick check as following. We will calculate the distance (similarity measure) between vectors and will compare with our expectation. If text sentences belong to different context then we expect the distance will be more and if sentences are close together then distance will be less. Because context of sentences 6 and 7 is different from other sentences we would expect to see this difference in results.

For calculating distance we use in the script cosine measure. With cosine measure most similar will be the one that have the highest cosine value. Below are results:
Note that 0 values mean that cosine value was not calculated because there is no need to do this. ( value already calculated for example for doc21 = doc12 or the value is on diagonal )

Results from Listing A (using own web embedings)
 1   2    3    4    5    6    7    8
1[0, 0.5, 0.1, 0.5, 0.6, 0.4, 0.2, 0.4],
2[0, 0,   0.2, 0.6, 0.5, 0.2, 0.1, 0.5],
3[0, 0,   0,   0.0, 0.0, 0.0, 0.1, 0.0],
4[0, 0,   0,   0,   0.6, 0.5, 0.3, 0.7],
5[0, 0,   0,   0,   0,   0.2, 0.2, 0.6],
6[0, 0,   0,   0,   0,   0,   0.4, 0.4], 
7[0, 0,   0,   0,   0,   0,   0,   0.3], 
8[0, 0,   0,   0,   0,   0,   0,   0]


Results from Listing B (using pretrained dataset):
  1  2     3     4     5     6     7     8
1[0, 0.77, 0.33, 0.57, 0.78, 0.35, 0.37, 0.55],
2[0, 0,    0.60, 0.62, 0.51, 0.31, 0.29, 0.59],
3[0, 0,    0,    0.16, 0.12, 0.18, 0.25, 0.11], 
4[0, 0,    0,    0,    0.62, 0.41, 0.37, 0.89],
5[0, 0,    0,    0,    0,    0.35, 0.27, 0.61], 
6[0, 0,    0,    0,    0,    0,    0.81, 0.37], 
7[0, 0,    0,    0,    0,    0,    0,    0.32],
8[0, 0,    0,    0,    0,    0,    0,    0]]

Looking at results we can see that our expectations are confirmed especially on results where pretrained word embeddings were used. Sentences 6,7 have low similarity with other sentences but have high similarity 0.81 when we compare sentence 6 with 7.

Conclusion

In this post we considered how to represent document (sentence, paragraph) as vector of numbers using word embeddings model word2vec. We looked at 2 possible ways – using own embeddings and using embeddings from Google. We got results for our small example and we were able to evaluate the results.

Now we can feed vector representation of text into machine learning text analysis algorithms.

Here are a few posts where you can find how to feed word2vec word embedding in text clustering algorithms such as kmeans from NLTK and sklearn libraries and how to plot data with TSNE :
K Means Clustering Example with Word2Vec in Data Mining or Machine Learning
Text Clustering with Word Embedding in Machine Learning

Below are few links for different word embedding models that are also widely used:
GloVe –
How to Convert Word to Vector with GloVe and Python
fastText –
FastText Word Embeddings

I hope you enjoyed this post about representing text as vector using word2vec. If you have any tips or anything else to add, please leave a comment in the reply box.

Listing A. Here is the python source code for using own word embeddings

from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'], 
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
			['and', 'this', 'is', 'the', 'last', 'post']]




model = Word2Vec(sentences, min_count=1, size=100)
vocab = model.vocab.keys()
wordsInVocab = len(vocab)
print (model.similarity('post', 'book'))


import numpy as np

def sent_vectorizer(sent, model):
    sent_vec = np.zeros(100)
    numw = 0
    for w in sent:
        try:
            sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
    return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))

from numpy import dot
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))] 

for i in range (len(V) - 1):
    for j in range(i+1, len(V)):
           results[i][j] = dot(V[i],V[j])/norm(V[i])/norm(V[j])


print (results)

Listing B. Here is the python source code for using word embeddings from Google.

import gensim
model = gensim.models.Word2Vec.load_word2vec_format('C:\\Users\\Downloads\\GoogleNews-vectors-negative300.bin', binary=True)  

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'], 
          ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
          ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
			['and', 'this', 'is', 'the', 'last', 'post']]


vocab = model.vocab.keys()
wordsInVocab = len(vocab)

import numpy as np

def sent_vectorizer(sent, model):
    sent_vec = np.zeros(50)
    numw = 0
    for w in sent:
        try:
            vc=model[w]
            vc=vc[0:50]
           
            sent_vec = np.add(sent_vec, vc) 
            numw+=1
        except:
            pass
    return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
    V.append(sent_vectorizer(sentence, model))
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))] 

for i in range (len(V) - 1):
    for j in range(i+1, len(V)):
           
       NVI=norm(V[i])
       NVJ=norm(V[j])
           
       dotVij =0
       NVI=0
       for x in range(50):
           NVI=NVI +  V[i][x]*V[i][x]
           
       NVJ=0
       for x in range(50):
           NVJ=NVJ +  V[j][x]*V[j][x]
            
       for x in range(50):
      
               dotVij = dotVij + V[i][x] * V[j][x]
         
      
       results[i][j] = dotVij / (NVI*NVJ) 

print (results)

References
1. Document Embedding with Paragraph Vectors

Sentiment Analysis of Twitter Data

Sentiment analysis of text (or opinion mining) allows us to extract opinion from user comments on the web. The applications of sentiment analysis can be such as understanding what customers think about product or product features, discovering user reaction on certain events.

A basic task in sentiment analysis of text is classifying the polarity of a given text from the document. Polarity can be classified as positive, negative, or neutral.

Advanced, “beyond polarity” sentiment classification looks at emotional states such as “angry”, “sad”, and “happy”. [1]

In this post you will find example how to calculate polarity in sentiment analysis for twitter data using python. Polarity in this example will have two labels: positive or negative.
In the end of this post you also will find links to several most comprehensive posts from other websites on the topic twitter sentiment analysis tutorial.

Dataset for Sentiment Analysis of Twitter Data

We will use dataset from Twitter that can be downloaded from this link [3] from CrowdFlower [4]. This dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. About 40000 rows of examples across 13 labels. A subset of this data was used in an experiment for Microsoft’s Cortana Intelligence Gallery.
The dataset has 4 columns
tweet_id
sentiment (for example happy, sad )
author
content

Preprocessing of Twitter Data

We will remove some special characters and links using below function found on Internet.

import re
# below function is based on example from 
# http://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/
def clean_tweet( tweet):
        '''
        Utility function to clean tweet text by removing links, special characters
        using simple regex statements.
        '''
        tweet = tweet.lower() 
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

Also we remove stop words as below

from many_stop_words import get_stop_words
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from itertools import chain


from nltk.classify import NaiveBayesClassifier, accuracy

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english'))   #About 150 stopwords
stop_words.extend(nltk_words)

def remove_stopwords(word_list):
               
        filtered_tweet=""
        for word in word_list:
            word = word.lower() 
            if word not in stopwords.words("english"):
                filtered_tweet=filtered_tweet + " " + word
        
        
        return filtered_tweet.lstrip()

Approach for Tweet Sentiment Analysis

We will divide tweets data into training and testing datasets. For training classifier for detecting polarity in the content column we will use training dataset with content (X) and sentiment (Y) fields.

As we already have emotion column for tweets we do not need do feature selection for classification.

However we will map emotions (13 categories) in positive negative, neutral and skip neutral.

Here is how we do mapping in the script:

polarity = {'empty' : 'N',
                'sadness' : 'N',
                'enthusiasm' : 'P',
                'neutral' : 'neutral',
                'worry' : 'N',
                'surprise' : 'P',
                'love' : 'P',
                'fun' : 'P',
                'hate' : 'N',
                'happiness' : 'P',
                'boredom' : 'N',
                'relief' : 'P',
                'anger' : 'N'
         }  

Text Classification – Using NLTK for Sentiment Analysis

There are different classifications techniques that can be utilized in sentiment analysis, the detailed survey of methods was published in the paper [2]. The paper has also accuracy comparison and sentiment analysis process description.

Our task is to train classifier to detect polarity (negative, positive) for not seen text tweets.
We will use NLTK NaiveBayesClassifier algorithm.

For NLTK we do not need to convert to numeric vectors like we do for ski-learn. We need just tokenize our text and then input to machine learning classification algorithm.

Our vocabulary consists of tweet words and polarity (P or N) for each tweet. Here is how it looks:

vocabulary for sentiment analysis twitter data with NLTK
vocabulary for sentiment analysis twitter data with NLTK

From vocabulary we need to create feature set for Naive Bayes Classifier that we are going to use. In our model each word in the tweet is treated as the feature. Each tweet is “projected” into vocabulary and each word in vocabulary is getting value True if this word is in the given tweet, and value False if the word is not found in the vocabulary. In the end of tweet we have the label for polarity of tweet.

Below is screenshot for feature set, the polarity label (N or P is highlighted, the vocabulary is decreased just to 10 tweets for this picture.

sentiment analysis twitter data - feature set
sentiment analysis twitter data – feature set
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
size = int(len(feature_set) * 0.2)
train_set, test_set = feature_set[size:], feature_set[:size]

classifier = NaiveBayesClassifier.train(train_set)
print(accuracy(classifier, test_set))

Results of Tweet Sentiment Analysis

Here are the results of execution python source code described above:
Accuracy 73%
Run time was long as 50 min and data sample was limited to 1000 rows. May be because laptop has only 6GB memory.

So we learned how to detect negative or positive polarity for sentiment analysis in twitter data. The results are showing that some improvements still would be needed. For example we could better preprocess twitter data using transformation of twitter slang words or short form words to regular words.

Below you can find full python source code.

# sentiment analysis of text twitter data
import re


# below function is based on http://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/
def clean_tweet( tweet):
        '''
        Utility function to clean tweet text by removing links, special characters
        using simple regex statements.
        '''
        tweet = tweet.lower() 
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    

# below few lines are from https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python   
from many_stop_words import get_stop_words
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from itertools import chain

from nltk.classify import NaiveBayesClassifier, accuracy
stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english'))   #About 150 stopwords
stop_words.extend(nltk_words)

def remove_stopwords(word_list):
 
        filtered_tweet=""
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                filtered_tweet=filtered_tweet + " " + word
        
        
        return filtered_tweet.lstrip()
    

filefolder="C:\\Users\\Downloads"
filename=filefolder + "\\text_emotion.csv"
   
polarity = {'empty' : 'N',
                'sadness' : 'N',
                'enthusiasm' : 'P',
                'neutral' : 'neutral',
                'worry' : 'N',
                'surprise' : 'P',
                'love' : 'P',
                'fun' : 'P',
                'hate' : 'N',
                'happiness' : 'P',
                'boredom' : 'N',
                'relief' : 'P',
                'anger' : 'N'
         }  
   
tweets = []
training_data = []
import csv
with open(filename) as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    count=0
    for row in csvReader:
      
        if (row[1] == 'neutral' or row[1] == 'sentiment') :
            continue
        tweet= clean_tweet(row[3])
        tweet = remove_stopwords(tweet.split())
        tweets.append(tweet)
        training_data.append([tweet,  polarity[row[1]] ])
        count=count+1
        if (count >1000):
            break
        
print (training_data)
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]

size = int(len(feature_set) * 0.2)
train_set, test_set = feature_set[size:], feature_set[:size]

classifier = NaiveBayesClassifier.train(train_set)
print(accuracy(classifier, test_set))

External Resources for Twitter Sentiment Analysis Tutorial

Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset and code
The author of this article is showing how to solve the Twitter Sentiment Analysis Practice Problem.

Another Twitter sentiment analysis with Python — Part 1 This is post 1 of series of 11 posts all about sentiment analysis twitter python and related concepts. The posts cover such topics like word embeddings and neural networks. Below are just 2 posts from this series.

Another Twitter sentiment analysis with Python — Part 10 (Neural Network with Doc2Vec/Word2Vec/GloVe)

Another Twitter sentiment analysis with Python — Part 11 (CNN + Word2Vec)

Yet Another Twitter Sentiment Analysis Part 1 — tackling class imbalance

Basic data analysis on Twitter with Python – Here you will find a simple data analysis program that takes a given number of tweets, analyzes them, and displays the data in a scatter plot. The data represent how Twitter users were perceiving the bot created by author and their sentiment.

References
1. Sentiment Analysis
2. Analysis of Various Sentiment Classification Techniques
3. Emotion Dataset
4. Data for Everyone

K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

In this post you will find K means clustering example with word2vec in python code. Word2Vec is one of the popular methods in language modeling and feature learning techniques in natural language processing (NLP). This method is used to create word embeddings in machine learning whenever we need vector representation of data.

For example in data clustering algorithms instead of bag of words (BOW) model we can use Word2Vec. The advantage of using Word2Vec is that it can capture the distance between individual words.

The example in this post will demonstrate how to use results of Word2Vec word embeddings in clustering algorithms. For this, Word2Vec model will be feeded into several K means clustering algorithms from NLTK and Scikit-learn libraries.

Here we will do clustering at word level. Our clusters will be groups of words. In case we need to cluster at sentence or paragraph level, here is the link that showing how to move from word level to sentence/paragraph level:

Text Clustering with Word Embedding in Machine Learning

There is also doc2vec word embedding model that is based on word2vec. doc2vec is created for embedding sentence/paragraph/document. Here is the link how to use doc2vec word embedding in machine learning:
Text Clustering with doc2vec Word Embedding Machine Learning Model

Getting Word2vec

Using word2vec from python library gensim is simple and well described in tutorials and on the web [3], [4], [5]. Here we just look at basic example. For the input we use the sequence of sentences hard-coded in the script.

from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
                        ['this', 'is', 'about', 'machine', 'learning', 'post'],  
			['and', 'this', 'is', 'the', 'last', 'post']
model = Word2Vec(sentences, min_count=1)

Now we have model with words embedded. We can query model for similar words like below or ask to represent word as vector:

print (model.similarity('this', 'is'))
print (model.similarity('post', 'book'))
#output -0.0198180344218
#output -0.079446731287
print (model.most_similar(positive=['machine'], negative=[], topn=2))
#output: [('new', 0.24608060717582703), ('is', 0.06899910420179367)]
print (model['the'])
#output [-0.00217354 -0.00237131  0.00296396 ...,  0.00138597  0.00291924  0.00409528]

To get vocabulary or the number of words in vocabulary:

print (list(model.vocab))
print (len(list(model.vocab)))

This will produce: [‘good’, ‘this’, ‘post’, ‘another’, ‘learning’, ‘last’, ‘the’, ‘and’, ‘more’, ‘new’, ‘is’, ‘one’, ‘about’, ‘machine’, ‘book’]

Now we will feed word embeddings into clustering algorithm such as k Means which is one of the most popular unsupervised learning algorithms for finding interesting segments in the data. It can be used for separating customers into groups, combining documents into topics and for many other applications.

You will find below two k means clustering examples.

K Means Clustering with NLTK Library
Our first example is using k means algorithm from NLTK library.
To use word embeddings word2vec in machine learning clustering algorithms we initiate X as below:

X = model[model.vocab]

Now we can plug our X data into clustering algorithms.

from nltk.cluster import KMeansClusterer
import nltk
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
print (assigned_clusters)
# output: [0, 2, 1, 2, 2, 1, 2, 2, 0, 1, 0, 1, 2, 1, 2]

In the python code above there are several options for the distance as below:

nltk.cluster.util.cosine_distance(u, v)
Returns 1 minus the cosine of the angle between vectors v and u. This is equal to 1 – (u.v / |u||v|).

nltk.cluster.util.euclidean_distance(u, v)
Returns the euclidean distance between vectors u and v. This is equivalent to the length of the vector (u – v).

Here we use cosine distance to cluster our data.
After we got cluster results we can associate each word with the cluster that it got assigned to:

words = list(model.vocab)
for i, word in enumerate(words):  
    print (word + ":" + str(assigned_clusters[i]))

Here is the output for the above:
good:0
this:2
post:1
another:2
learning:2
last:1
the:2
and:2
more:0
new:1
is:0
one:1
about:2
machine:1
book:2

K Means Clustering with Scikit-learn Library

This example is based on k means from scikit-learn library.

from sklearn import cluster
from sklearn import metrics
kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)

print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(X))

silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')

print ("Silhouette_score: ")
print (silhouette_score)

In this example we also got some useful metrics to estimate clustering performance.
Output:

Cluster id labels for inputted data
[0 1 1 ..., 1 2 2]
Centroids data
[[ -3.82586889e-04   1.39791325e-03  -2.13839358e-03 ...,  -8.68172920e-04
   -1.23599875e-03   1.80053393e-03]
 [ -3.11774168e-04  -1.63297475e-03   1.76715955e-03 ...,  -1.43826099e-03
    1.22940990e-03   1.06353679e-03]
 [  1.91571176e-04   6.40696089e-04   1.38173658e-03 ...,  -3.26442620e-03
   -1.08828480e-03  -9.43636987e-05]]

Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):
-0.00894730946094
Silhouette_score: 
0.0427737

Here is the full python code of the script.

# -*- coding: utf-8 -*-



from gensim.models import Word2Vec

from nltk.cluster import KMeansClusterer
import nltk


from sklearn import cluster
from sklearn import metrics

# training data

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'],  
			['and', 'this', 'is', 'the', 'last', 'post']]


# training model
model = Word2Vec(sentences, min_count=1)

# get vector data
X = model[model.vocab]
print (X)

print (model.similarity('this', 'is'))

print (model.similarity('post', 'book'))

print (model.most_similar(positive=['machine'], negative=[], topn=2))

print (model['the'])

print (list(model.vocab))

print (len(list(model.vocab)))




NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
print (assigned_clusters)

words = list(model.vocab)
for i, word in enumerate(words):  
    print (word + ":" + str(assigned_clusters[i]))



kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)

print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(X))

silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')

print ("Silhouette_score: ")
print (silhouette_score)

References
1. Word embedding
2. Comparative study of word embedding methods in topic segmentation
3. models.word2vec – Deep learning with word2vec
4. Word2vec Tutorial
5. How to Develop Word Embeddings in Python with Gensim
6. nltk.cluster package

Using Pretrained Word Embeddings in Machine Learning

In this post you will learn how to use pre-trained word embeddings in machine learning. Google provides News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors).

Download file from this link word2vec-GoogleNews-vectors and save it in some local folder. Open it with zip program and extract the .bin file. So instead of file GoogleNews-vectors-negative300.bin.gz you will have the file GoogleNews-vectors-negative300.bin

Now you can use the below snippet to load this file using gensim. Change the file path to actual file folder where you saved the file in the previous step.

Gensim
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. It is Python framework for fast Vector Space Modelling.

The below python code snippet demonstrates how to load pretrained Google file into the model and then query model for example for similarity between word.
# -*- coding: utf-8 -*-

import gensim

model = gensim.models.Word2Vec.load_word2vec_format('C:\\Users\\GoogleNews-vectors-negative300.bin', binary=True)  

vocab = model.vocab.keys()
wordsInVocab = len(vocab)
print (wordsInVocab)
print (model.similarity('this', 'is'))
print (model.similarity('post', 'book'))

Output from the above code:
3000000
0.407970363878
0.0572043891977

You can do all other things same way as if you would use own trained word embeddings. The Google file however is big, it is 1.5 GB original size, and unzipped it has 3.3GB. On my 6GB RAM laptop it took a while to run the below code. But it run it. However some other commands I was not able to run.

See this post K Means Clustering Example with Word2Vec which is showing embedding in machine learning algorithm. Here Word2Vec model will be feeded into several k-means clustering algorithms from NLTK and Scikit-learn libraries.

GloVe and fastText Word Embedding in Machine Learning

Word2vec is not the the only word embedding available for use. Below are the few links for other word embeddings.
Here How to Convert Word to Vector with GloVe and Python you will find how to convert word to vector with GloVe – Global Vectors for Word Representation. Detailed example is shown how to use pretrained GloVe data file that can be downloaded.

And one more link is here FastText Word Embeddings for Text Classification with MLP and Python In this post you will discover fastText word embeddings – how to load pretrained fastText, get text embeddings and use it in document classification example.

1. Google’s trained Word2Vec model in Python
2. word2vec-GoogleNews-vectors
3. gensim 3.1.0