## Document Similarity in Machine Learning Text Analysis with ELMo

In this post we will look at using ELMo for computing similarity between text documents. Elmo is one of the word embeddings techniques that are widely used now. In the previous post we used TF-IDF for calculating text documents similarity. TF-IDF is based on word frequency counting. Both techniques can be used for converting text to numbers in information retrieval machine learning algorithms. The good tutorial that explains how ElMo is working and how it is built is Deep Contextualized Word Representations with ELMo
Another resource is at ELMo

We will however focus on the practical side of computing similarity between text documents with ELMo. Below is the code to accomplish this task. To compute elmo embeddings I used function from Analytics Vidhya machine learning post at learn-to-use-elmo-to-extract-features-from-text/

We will use cosine_similarity module from sklearn to calculate similarity between numeric vectors. It computes cosine similarity between samples in X and Y as the normalized dot product of X and Y.

```# -*- coding: utf-8 -*-

from sklearn.metrics.pairwise import cosine_similarity

import tensorflow_hub as hub
import tensorflow as tf

def elmo_vectors(x):

embeddings=elmo(x, signature="default", as_dict=True)["elmo"]

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
# return average of ELMo features
return sess.run(tf.reduce_mean(embeddings,1))
```

Our data input will be the same as in previous post for TF-IDF: collection the sentences as an array. So each document here is represented just by one sentence.

```corpus=["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]

```

Below we do elmo embedding for each document and create matrix for all collection. If we print elmo_embeddings for i=0 we will get word embeddings vector [ 0.02739557 -0.1004054 0.12195794 … -0.06023929 0.19663551 0.3809018 ] which is numeric representation of the first document.

```elmo_embeddings=[]
print (len(corpus))
for i in range(len(corpus)):
print (corpus[i])
elmo_embeddings.append(elmo_vectors([corpus[i]]))

```

Finally we can print embeddings and similarity matrix

```print ( elmo_embeddings)
print(cosine_similarity(elmo_embeddings, elmo_embeddings))

[array([ 0.02739557, -0.1004054 ,  0.12195794, ..., -0.06023929,
0.19663551,  0.3809018 ], dtype=float32), array([ 0.08833811, -0.21392687, -0.0938901 , ..., -0.04924499,
0.08270906,  0.25595033], dtype=float32), array([ 0.45237526, -0.00928468,  0.5245862 , ...,  0.00988374,
-0.03330074,  0.25460464], dtype=float32), array([-0.14745474, -0.25623208,  0.20231596, ..., -0.11443609,
-0.03759   ,  0.18829307], dtype=float32), array([-0.44559947, -0.1429281 , -0.32497618, ...,  0.01917108,
-0.29726124, -0.02022664], dtype=float32), array([-0.2502797 ,  0.09800234, -0.1026585 , ..., -0.22239089,
0.2981896 ,  0.00978719], dtype=float32)]

The similarity matrix computed as :
[[0.9999998  0.609864   0.574287   0.53863835 0.39638174 0.35737067]
[0.609864   0.99999976 0.6036072  0.5824003  0.39648792 0.39825168]
[0.574287   0.6036072  0.9999998  0.7760986  0.3858403  0.33461633]
[0.53863835 0.5824003  0.7760986  0.9999995  0.4922789  0.35490626]
[0.39638174 0.39648792 0.3858403  0.4922789  0.99999976 0.73076516]
[0.35737067 0.39825168 0.33461633 0.35490626 0.73076516 1.0000002 ]]
```

Now we can compare this similarity matrix with matrix obtained with TF-IDF in prev post. Obviously they are different.

Thus, we calculated similarity between textual documents using ELMo. This post and previous post about using TF-IDF for the same task are great machine learning exercises. Because we use text conversion to numbers, document similarity in many algorithms of information retrieval, data science or machine learning.

## Text Clustering with doc2vec Word Embedding Machine Learning Model In this post we will look at doc2vec word embedding model, how to build it or use pretrained embedding file. For practical example we will explore how to do text clustering with doc2vec model.

## Doc2vec

Doc2vec is an unsupervised computer algorithm to generate vectors for sentence/paragraphs/documents. The algorithm is an adaptation of word2vec which can generate vectors for words. Below you can see frameworks for learning word vector word2vec (left side) and paragraph vector doc2vec (right side). For learning doc2vec, the paragraph vector was added to represent the missing information from the current context and to act as a memory of the topic of the paragraph. 

If you need information about word2vec here are some posts:
word2vec –
Vector Representation of Text – Word Embeddings with word2vec
word2vec application –
Text Analytics Techniques with Embeddings
Using Pretrained Word Embeddinigs in Machine Learning
K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

The vectors generated by doc2vec can be used for tasks like finding similarity between sentences / paragraphs / documents.  With doc2vec you can get vector for sentence or paragraph out of model without additional computations as you would do it in word2vec, for example here we used function to go from word level to sentence level:
Text Clustering with Word Embedding in Machine Learning

word2vec was very successful and it created idea to convert many other specific texts to vector. It can called “anything to vector”. So there are many different word embedding models that like doc2vec can convert more than one word to numeric vector.  Here are few examples:

tweet2vec Tweet2Vec: Character-Based Distributed Representations for Social Media
lda2vec Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec. Here is proposed model that learns dense word vectors jointly with Dirichlet-distributed latent document-level mixtures of topic vectors.
Topic2Vec Learning Distributed Representations of Topics
Med2vec Multi-layer Representation Learning for Medical Concepts
The list can go on. In the next section we will look how to load doc2vec and use for text clustering.

## Building doc2vec Model

Here is the example for converting word paragraph to vector using own built doc2vec model. The example is taken from .

The script consists of the following main steps:

• build model using own text
• save model to file
• load model from this file
• infer vector representation
```
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

print (common_texts)

"""
output:
[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
"""

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]

print (documents)
"""
output
[TaggedDocument(words=['human', 'interface', 'computer'], tags=), TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=), TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=), TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=), TaggedDocument(words=['user', 'response', 'time'], tags=), TaggedDocument(words=['trees'], tags=), TaggedDocument(words=['graph', 'trees'], tags=), TaggedDocument(words=['graph', 'minors', 'trees'], tags=), TaggedDocument(words=['graph', 'minors', 'survey'], tags=)]

"""

model = Doc2Vec(documents, size=5, window=2, min_count=1, workers=4)
#Persist a model to disk:

from gensim.test.utils import get_tmpfile
fname = get_tmpfile("my_doc2vec_model")

print (fname)
#output: C:\Users\userABC\AppData\Local\Temp\my_doc2vec_model

model.save(fname)
# you can continue training with the loaded model!
#If you’re finished training a model (=no more updates, only querying, reduce memory usage), you can do:

model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

#Infer vector for a new document:
#Here our text paragraph just 2 words
vector = model.infer_vector(["system", "response"])
print (vector)

"""
output

[-0.08390492  0.01629403 -0.08274432  0.06739668 -0.07021132]

"""
```

## Using Pretrained doc2vec Model

We can skip building embedding file step and use already built file. Here is an example how to do coding with pretrained word embedding file for representing test docs as vectors. The script is based on .

The below script is using pretrained on Wikipedia data doc2vec model from this location

Here is the link where you can find links to different pre-trained doc2vec and word2vec models and additional information.

You need to download zip file, unzip , put 3 files at some folder and provide path in the script. In this example it is “doc2vec/doc2vec.bin”

The main steps of the below script consist of just load doc2vec model and infer vectors.

```
import gensim.models as g
import codecs

model="doc2vec/doc2vec.bin"
test_docs="data/test_docs.txt"
output_file="data/test_vectors.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]

#infer test vectors
output = open(output_file, "w")
for d in test_docs:
output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
output.flush()
output.close()

"""
output file
0.03772797 0.07995503 -0.1598981 0.04817521 0.033129826 -0.06923918 0.12705861 -0.06330753 .........
"""
```

So we got output file with vectors (one per each paragraph). That means we successfully converted our text to vectors. Now we can use it for different machine learning algorithms such as text classification, text clustering and many other. Next section will show example for Birch clustering algorithm with word embeddings.

## Using Pretrained doc2vec Model for Text Clustering (Birch Algorithm)

In this example we use Birch clustering algorithm for clustering text data file from 
Birch is unsupervised algorithm that is used for hierarchical clustering. An advantage of this algorithm is its ability to incrementally and dynamically cluster incoming data 

We use the following steps here:

• Load text docs that will be clustered
• Convert docs to vectors (infer_vector)
• Do clustering
```from sklearn import metrics

import gensim.models as g
import codecs

model="doc2vec/doc2vec.bin"
test_docs="data/test_docs.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]

print (test_docs)
"""
[['the', 'cardigan', 'welsh', 'corgi'........
"""

X=[]
for d in test_docs:

X.append( m.infer_vector(d, alpha=start_alpha, steps=infer_epoch) )

k=3

from sklearn.cluster import Birch

brc = Birch(branching_factor=50, n_clusters=k, threshold=0.1, compute_labels=True)
brc.fit(X)

clusters = brc.predict(X)

labels = brc.labels_

print ("Clusters: ")
print (clusters)

silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')

print ("Silhouette_score: ")
print (silhouette_score)

"""
Clusters:
[1 0 0 1 1 2 1 0 1 1]
Silhouette_score:
0.17644188
"""

```

If you want to get some test with text clustering and word embeddings here is the online demo Currently it is using word2vec and glove models and k means clustering algorithm. Select ‘Text Clustering’ option and scroll down to input data.

## Conclusion

We looked what is doc2vec is, we investigated 2 ways to load this model: we can create embedding model file from our text or use pretrained embedding file. We applied doc2vec to do Birch algorithm for text clustering. In case we need to work with paragraph / sentences / docs, doc2vec can simplify word embedding for converting text to vectors.

## Document Similarity, Tokenization and Word Vectors in Python with spaCY Calculating document similarity is very frequent task in Information Retrieval or Text Mining. Years ago we would need to build a document-term matrix or term-document matrix that describes the frequency of terms that occur in a collection of documents and then do word vectors math to find similarity. Now by using spaCY it can be done just within few lines. Below you will find how to get document similarity , tokenization and word vectors with spaCY.

spaCY is an open-source library designed to help you build NLP applications. It has a lot of features, we will look in this post only at few but very useful.

## Document Similarity

Here is how to get document similarity:

```import spacy

doc1 = nlp(u'Hello this is document similarity calculation')
doc2 = nlp(u'Hello this is python similarity calculation')
doc3 = nlp(u'Hi there')

print (doc1.similarity(doc2))
print (doc2.similarity(doc3))
print (doc1.similarity(doc3))

Output:
0.94
0.33
0.30
```

In more realistic situations we would load documents from files and would have longer text. Here is the experiment that I performed. I saved 3 articles from different random sites, two about deep learning and one about feature engineering.

```def get_file_contents(filename):
with open(filename, 'r') as filehandle:
return (filecontent)

fn1="deep_learning1.txt"
fn2="feature_eng.txt"
fn3="deep_learning.txt"

fn1_doc=get_file_contents(fn1)
print (fn1_doc)

fn2_doc=get_file_contents(fn2)
print (fn2_doc)

fn3_doc=get_file_contents(fn3)
print (fn3_doc)

doc1 = nlp(fn1_doc)
doc2 = nlp(fn2_doc)
doc3 = nlp(fn3_doc)

print ("dl1 - features")
print (doc1.similarity(doc2))
print ("feature - dl")
print (doc2.similarity(doc3))
print ("dl1 - dl")
print (doc1.similarity(doc3))

"""
output:
dl1 - features
0.9700237040142454
feature - dl
0.9656364096761337
dl1 - dl
0.9547075478662724
"""

```

It was able to assign higher similarity score for documents with similar topics!

## Tokenization

Another very useful and simple feature that can be done with spaCY is tokenization. Here is how easy to convert text into tokens (words):

```for token in doc1:
print(token.text)
print (token.vector)
```

## Word Vectors

spaCY has integrated word vectors support, while other libraries like NLTK do not have it. Below line will print word embeddings – array of 768 numbers on my environment.

```
print (token.vector)   #-  prints word vector form of token.
print (doc1.vector) #- prints word vector form of first token of document.
print (doc1.vector)    #- prints mean vector form for doc1
```

So we looked how to use few features (similarity, tokenization and word embeddings) which are very easy to implement with spaCY. I hope you enjoyed this post. If you have any tips or anything else to add, please leave a comment below.

References
1. spaCY
2. Word Embeddings in Python with Spacy and Gensim

## How to Convert Word to Vector with GloVe and Python

In the previous post we looked at Vector Representation of Text with word embeddings using word2vec. Another approach that can be used to convert word to vector is to use GloVe – Global Vectors for Word Representation. Per documentation from home page of GloVe  “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus”. Thus we can convert word to vector using GloVe.

At this post we will look how to use pretrained GloVe data file that can be downloaded from . We will look how to get word vector representation from this downloaded datafile. We will also look how to get nearest words. Why do we need vector representation of text? Because this is what we input to machine learning or data science algorithms – we feed numerical vectors to algorithms such as text classification, machine learning clustering or other text analytics algorithms.

The code that I put here is based on some examples that I found on StackOverflow .

So first you need to open the file and load data into the model. Then you can get the vector representation and other things.

Below is the full source code for glove python script:

```file = "C:\\Users\\glove\\glove.6B.50d.txt"
import numpy as np

with open(gloveFile, encoding="utf8" ) as f:
model = {}
for line in content:
splitLine = line.split()
word = splitLine
embedding = np.array([float(val) for val in splitLine[1:]])
model[word] = embedding
return model

print (model['hello'])

"""
Below is the output of the above code
[-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
-0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
-0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
-0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
0.67204 ]
"""
```

So we got numerical representation of word ‘hello’.
We can use also pandas to load GloVe file. Below are functions for loading with pandas and getting vector information.

```import pandas as pd
import csv

def vec(w):
return words.loc[w].as_matrix()

print (vec('hello'))    #this will print same as print (model['hello'])  before

```

### Finding Closest Word or Words

Now how do we find closest word to word “table”? We iterate through pandas dataframe, find deltas and then use numpy argmin function.
The closest word to some word will be always this word itself (as delta = 0) so I needed to drop the word ‘table’ and also next closest word ‘tables’. The final output for the closest word was “place”

```words = words.drop("table", axis=0)
words = words.drop("tables", axis=0)

words_matrix = words.as_matrix()

def find_closest_word(v):
diff = words_matrix - v
delta = np.sum(diff * diff, axis=1)
i = np.argmin(delta)
return words.iloc[i].name

print (find_closest_word(model['table']))
#output:  place

#If we want retrieve more than one closest words here is the function:

def find_N_closest_word(v, N, words):
Nwords=[]
for w in range(N):
diff = words.as_matrix() - v
delta = np.sum(diff * diff, axis=1)
i = np.argmin(delta)
Nwords.append(words.iloc[i].name)
words = words.drop(words.iloc[i].name, axis=0)

return Nwords

print (find_N_closest_word(model['table'], 10, words))

#Output:
#['table', 'tables', 'place', 'sit', 'set', 'hold', 'setting', 'here', 'placing', 'bottom']
```

We can also use gensim word2vec library functionalities after we load GloVe file.

```from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file=file, word2vec_output_file="gensim_glove_vectors.txt")

###Finally, read the word2vec txt to a gensim model using KeyedVectors:

from gensim.models.keyedvectors import KeyedVectors

```

### Difference between word2vec and GloVe

Both models learn geometrical encodings (vectors) of words from their co-occurrence information. They differ in the way how they learn this information. word2vec is using a “predictive” model (feed-forward neural network), whereas GloVe is using a “count-based” model (dimensionality reduction on the co-occurrence counts matrix). 

## Vector Representation of Text – Word Embeddings with word2vec

Computers can not understand the text. We need to convert text into numerical vectors before any kind of text analysis like text clustering or classification. The classical well known model is bag of words (BOW). With this model we have one dimension per each unique word in vocabulary. We represent the document as vector with 0s and 1s. We use 1 if the word from vocabulary exists in the document.

Recently new models with word embedding in machine learning gained popularity since they allow to keep semantic information. With word embeddings we can get lower dimensionality than with BOW model. There are several such models for example Glove, word2vec that are used in machine learning text analysis.

Many examples on the web are showing how to operate at word level with word embeddings methods but in the most cases we are working at the document level (sentence, paragraph or document) To get understanding how it can be used for text analytics I decided to take word2vect and create small practical example.

In this post you will learn how to use word embedding word2vect method for converting sentence into numerical vector. The same technique can be used for text with more than one sentence. We will create python script that converts sentences into numerical vectors.

### Input

For the input for this script we will use hard coded in the script sentences. The sentences in the script will be already tokenized. Below you can find sentences for our input. Note that sentences 6 and 7 are more distinguish from other sentences.

```1 [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
2 ['this', 'is',  'another', 'book'],
3 ['one', 'more', 'book'],
4 ['this', 'is', 'the', 'new', 'post'],
5 ['this', 'is', 'about', 'machine', 'learning', 'post'],
6 ['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
7 ['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
8 ['and', 'this', 'is', 'the', 'last', 'post']]
```

With word2vec you have two options:
2. Use pretrained data from Google

### From word to sentence

Each word in word embeddings is represented by the vector. But let’s say we are working with tweets from twitter and need to know how similar or dissimilar are tweets? So we need to have vector representation of whole text in tweet. To achieve this we can do average word embeddings for each word in sentence (or tweet or paragraph) The idea come from paper . In this paper the authors averaged word embeddings to get paragraph vector.

### Source code for conversion

Below in Listing A and Listing B you can find how we can average word embeddings and get numerical vectors.
Listing A has the python source code for using own word embeddings.
Listing B has the python source code for using word embeddings from Google.

When averaging embeddings I was using 50 first dimensions. This is the minimal number that was used in one of the papers. The recommendation is to use between 100-400 dimensions.

### Analysis of Results

How do we know that our results are good? We will do here a quick check as following. We will calculate the distance (similarity measure) between vectors and will compare with our expectation. If text sentences belong to different context then we expect the distance will be more and if sentences are close together then distance will be less. Because context of sentences 6 and 7 is different from other sentences we would expect to see this difference in results.

For calculating distance we use in the script cosine measure. With cosine measure most similar will be the one that have the highest cosine value. Below are results:
Note that 0 values mean that cosine value was not calculated because there is no need to do this. ( value already calculated for example for doc21 = doc12 or the value is on diagonal )

```Results from Listing A (using own web embedings)
1   2    3    4    5    6    7    8
1[0, 0.5, 0.1, 0.5, 0.6, 0.4, 0.2, 0.4],
2[0, 0,   0.2, 0.6, 0.5, 0.2, 0.1, 0.5],
3[0, 0,   0,   0.0, 0.0, 0.0, 0.1, 0.0],
4[0, 0,   0,   0,   0.6, 0.5, 0.3, 0.7],
5[0, 0,   0,   0,   0,   0.2, 0.2, 0.6],
6[0, 0,   0,   0,   0,   0,   0.4, 0.4],
7[0, 0,   0,   0,   0,   0,   0,   0.3],
8[0, 0,   0,   0,   0,   0,   0,   0]

Results from Listing B (using pretrained dataset):
1  2     3     4     5     6     7     8
1[0, 0.77, 0.33, 0.57, 0.78, 0.35, 0.37, 0.55],
2[0, 0,    0.60, 0.62, 0.51, 0.31, 0.29, 0.59],
3[0, 0,    0,    0.16, 0.12, 0.18, 0.25, 0.11],
4[0, 0,    0,    0,    0.62, 0.41, 0.37, 0.89],
5[0, 0,    0,    0,    0,    0.35, 0.27, 0.61],
6[0, 0,    0,    0,    0,    0,    0.81, 0.37],
7[0, 0,    0,    0,    0,    0,    0,    0.32],
8[0, 0,    0,    0,    0,    0,    0,    0]]
```

Looking at results we can see that our expectations are confirmed especially on results where pretrained word embeddings were used. Sentences 6,7 have low similarity with other sentences but have high similarity 0.81 when we compare sentence 6 with 7.

### Conclusion

In this post we considered how to represent document (sentence, paragraph) as vector of numbers using word embeddings model word2vec. We looked at 2 possible ways – using own embeddings and using embeddings from Google. We got results for our small example and we were able to evaluate the results.

Now we can feed vector representation of text into machine learning text analysis algorithms.

Here are a few posts where you can find how to feed word2vec word embedding in text clustering algorithms such as kmeans from NLTK and sklearn libraries and how to plot data with TSNE :
K Means Clustering Example with Word2Vec in Data Mining or Machine Learning
Text Clustering with Word Embedding in Machine Learning

Below are few links for different word embedding models that are also widely used:
GloVe –
How to Convert Word to Vector with GloVe and Python
fastText –
FastText Word Embeddings

Listing A. Here is the python source code for using own word embeddings

```from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
['this', 'is',  'another', 'book'],
['one', 'more', 'book'],
['this', 'is', 'the', 'new', 'post'],
['this', 'is', 'about', 'machine', 'learning', 'post'],
['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
['and', 'this', 'is', 'the', 'last', 'post']]

model = Word2Vec(sentences, min_count=1, size=100)
vocab = model.vocab.keys()
wordsInVocab = len(vocab)
print (model.similarity('post', 'book'))

import numpy as np

def sent_vectorizer(sent, model):
sent_vec = np.zeros(100)
numw = 0
for w in sent:
try:
numw+=1
except:
pass
return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
V.append(sent_vectorizer(sentence, model))

from numpy import dot
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))]

for i in range (len(V) - 1):
for j in range(i+1, len(V)):
results[i][j] = dot(V[i],V[j])/norm(V[i])/norm(V[j])

print (results)

```

Listing B. Here is the python source code for using word embeddings from Google.

```import gensim

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
['this', 'is',  'another', 'book'],
['one', 'more', 'book'],
['this', 'is', 'the', 'new', 'post'],
['this', 'is', 'about', 'machine', 'learning', 'post'],
['orange', 'juice', 'is', 'the', 'liquid', 'extract', 'of', 'the', 'fruit'],
['orange', 'juice', 'comes', 'in', 'several', 'different', 'varieties'],
['and', 'this', 'is', 'the', 'last', 'post']]

vocab = model.vocab.keys()
wordsInVocab = len(vocab)

import numpy as np

def sent_vectorizer(sent, model):
sent_vec = np.zeros(50)
numw = 0
for w in sent:
try:
vc=model[w]
vc=vc[0:50]

numw+=1
except:
pass
return sent_vec / np.sqrt(sent_vec.dot(sent_vec))

V=[]
for sentence in sentences:
V.append(sent_vectorizer(sentence, model))
from numpy.linalg import norm
results = [[0 for i in range(len(V))] for j in range(len(V))]

for i in range (len(V) - 1):
for j in range(i+1, len(V)):

NVI=norm(V[i])
NVJ=norm(V[j])

dotVij =0
NVI=0
for x in range(50):
NVI=NVI +  V[i][x]*V[i][x]

NVJ=0
for x in range(50):
NVJ=NVJ +  V[j][x]*V[j][x]

for x in range(50):

dotVij = dotVij + V[i][x] * V[j][x]

results[i][j] = dotVij / (NVI*NVJ)

print (results)
```

References
1. Document Embedding with Paragraph Vectors

## K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

In this post you will find K means clustering example with word2vec in python code. Word2Vec is one of the popular methods in language modeling and feature learning techniques in natural language processing (NLP). This method is used to create word embeddings in machine learning whenever we need vector representation of data.

For example in data clustering algorithms instead of bag of words (BOW) model we can use Word2Vec. The advantage of using Word2Vec is that it can capture the distance between individual words.

The example in this post will demonstrate how to use results of Word2Vec word embeddings in clustering algorithms. For this, Word2Vec model will be feeded into several K means clustering algorithms from NLTK and Scikit-learn libraries. Here we will do clustering at word level. Our clusters will be groups of words. In case we need to cluster at sentence or paragraph level, here is the link that showing how to move from word level to sentence/paragraph level:

Text Clustering with Word Embedding in Machine Learning

There is also doc2vec word embedding model that is based on word2vec. doc2vec is created for embedding sentence/paragraph/document. Here is the link how to use doc2vec word embedding in machine learning:
Text Clustering with doc2vec Word Embedding Machine Learning Model

## Getting Word2vec

Using word2vec from python library gensim is simple and well described in tutorials and on the web , , . Here we just look at basic example. For the input we use the sequence of sentences hard-coded in the script.

```from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
['this', 'is',  'another', 'book'],
['one', 'more', 'book'],
['this', 'is', 'the', 'new', 'post'],
['this', 'is', 'about', 'machine', 'learning', 'post'],
['and', 'this', 'is', 'the', 'last', 'post']
model = Word2Vec(sentences, min_count=1)
```

Now we have model with words embedded. We can query model for similar words like below or ask to represent word as vector:

```print (model.similarity('this', 'is'))
print (model.similarity('post', 'book'))
#output -0.0198180344218
#output -0.079446731287
print (model.most_similar(positive=['machine'], negative=[], topn=2))
#output: [('new', 0.24608060717582703), ('is', 0.06899910420179367)]
print (model['the'])
#output [-0.00217354 -0.00237131  0.00296396 ...,  0.00138597  0.00291924  0.00409528]
```

To get vocabulary or the number of words in vocabulary:

```print (list(model.vocab))
print (len(list(model.vocab)))
```

This will produce: [‘good’, ‘this’, ‘post’, ‘another’, ‘learning’, ‘last’, ‘the’, ‘and’, ‘more’, ‘new’, ‘is’, ‘one’, ‘about’, ‘machine’, ‘book’]

Now we will feed word embeddings into clustering algorithm such as k Means which is one of the most popular unsupervised learning algorithms for finding interesting segments in the data. It can be used for separating customers into groups, combining documents into topics and for many other applications.

You will find below two k means clustering examples.

K Means Clustering with NLTK Library
Our first example is using k means algorithm from NLTK library.
To use word embeddings word2vec in machine learning clustering algorithms we initiate X as below:

```X = model[model.vocab]
```

Now we can plug our X data into clustering algorithms.

```from nltk.cluster import KMeansClusterer
import nltk
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
print (assigned_clusters)
# output: [0, 2, 1, 2, 2, 1, 2, 2, 0, 1, 0, 1, 2, 1, 2]
```

In the python code above there are several options for the distance as below:

nltk.cluster.util.cosine_distance(u, v)
Returns 1 minus the cosine of the angle between vectors v and u. This is equal to 1 – (u.v / |u||v|).

nltk.cluster.util.euclidean_distance(u, v)
Returns the euclidean distance between vectors u and v. This is equivalent to the length of the vector (u – v).

Here we use cosine distance to cluster our data.
After we got cluster results we can associate each word with the cluster that it got assigned to:

```words = list(model.vocab)
for i, word in enumerate(words):
print (word + ":" + str(assigned_clusters[i]))
```

Here is the output for the above:
good:0
this:2
post:1
another:2
learning:2
last:1
the:2
and:2
more:0
new:1
is:0
one:1
machine:1
book:2

## K Means Clustering with Scikit-learn Library

This example is based on k means from scikit-learn library.

```from sklearn import cluster
from sklearn import metrics
kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)

print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(X))

silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')

print ("Silhouette_score: ")
print (silhouette_score)
```

In this example we also got some useful metrics to estimate clustering performance.
Output:

```Cluster id labels for inputted data
[0 1 1 ..., 1 2 2]
Centroids data
[[ -3.82586889e-04   1.39791325e-03  -2.13839358e-03 ...,  -8.68172920e-04
-1.23599875e-03   1.80053393e-03]
[ -3.11774168e-04  -1.63297475e-03   1.76715955e-03 ...,  -1.43826099e-03
1.22940990e-03   1.06353679e-03]
[  1.91571176e-04   6.40696089e-04   1.38173658e-03 ...,  -3.26442620e-03
-1.08828480e-03  -9.43636987e-05]]

Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):
-0.00894730946094
Silhouette_score:
0.0427737
```

Here is the full python code of the script.

```# -*- coding: utf-8 -*-

from gensim.models import Word2Vec

from nltk.cluster import KMeansClusterer
import nltk

from sklearn import cluster
from sklearn import metrics

# training data

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
['this', 'is',  'another', 'book'],
['one', 'more', 'book'],
['this', 'is', 'the', 'new', 'post'],
['this', 'is', 'about', 'machine', 'learning', 'post'],
['and', 'this', 'is', 'the', 'last', 'post']]

# training model
model = Word2Vec(sentences, min_count=1)

# get vector data
X = model[model.vocab]
print (X)

print (model.similarity('this', 'is'))

print (model.similarity('post', 'book'))

print (model.most_similar(positive=['machine'], negative=[], topn=2))

print (model['the'])

print (list(model.vocab))

print (len(list(model.vocab)))

NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
print (assigned_clusters)

words = list(model.vocab)
for i, word in enumerate(words):
print (word + ":" + str(assigned_clusters[i]))

kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)

print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(X))

silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')

print ("Silhouette_score: ")
print (silhouette_score)
```

## Using Pretrained Word Embeddings in Machine Learning

In this post you will learn how to use pre-trained word embeddings in machine learning. Google provides News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors).

Now you can use the below snippet to load this file using gensim. Change the file path to actual file folder where you saved the file in the previous step.

Gensim
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. It is Python framework for fast Vector Space Modelling.

The below python code snippet demonstrates how to load pretrained Google file into the model and then query model for example for similarity between word.
# -*- coding: utf-8 -*-

```import gensim

vocab = model.vocab.keys()
wordsInVocab = len(vocab)
print (wordsInVocab)
print (model.similarity('this', 'is'))
print (model.similarity('post', 'book'))

Output from the above code:
3000000
0.407970363878
0.0572043891977
```

You can do all other things same way as if you would use own trained word embeddings. The Google file however is big, it is 1.5 GB original size, and unzipped it has 3.3GB. On my 6GB RAM laptop it took a while to run the below code. But it run it. However some other commands I was not able to run.

See this post K Means Clustering Example with Word2Vec which is showing embedding in machine learning algorithm. Here Word2Vec model will be feeded into several k-means clustering algorithms from NLTK and Scikit-learn libraries.

## GloVe and fastText Word Embedding in Machine Learning

Word2vec is not the the only word embedding available for use. Below are the few links for other word embeddings.
Here How to Convert Word to Vector with GloVe and Python you will find how to convert word to vector with GloVe – Global Vectors for Word Representation. Detailed example is shown how to use pretrained GloVe data file that can be downloaded.

And one more link is here FastText Word Embeddings for Text Classification with MLP and Python In this post you will discover fastText word embeddings – how to load pretrained fastText, get text embeddings and use it in document classification example.