In the previous post we looked at Vector Representation of Text with word embeddings using word2vec. Another approach that can be used to convert word to vector is to use GloVe – Global Vectors for Word Representation. Per documentation from home page of GloVe [1] “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus”. Thus we can convert word to vector using GloVe.
At this post we will look how to use pretrained GloVe data file that can be downloaded from [1].
We will look how to get word vector representation from this downloaded datafile. We will also look how to get nearest words. Why do we need vector representation of text? Because this is what we input to machine learning or data science algorithms – we feed numerical vectors to algorithms such as text classification, machine learning clustering or other text analytics algorithms.
Loading Glove Datafile
The code that I put here is based on some examples that I found on StackOverflow [2].
So first you need to open the file and load data into the model. Then you can get the vector representation and other things.
Below is the full source code for glove python script:
file = "C:\\Users\\glove\\glove.6B.50d.txt" import numpy as np def loadGloveModel(gloveFile): print ("Loading Glove Model") with open(gloveFile, encoding="utf8" ) as f: content = f.readlines() model = {} for line in content: splitLine = line.split() word = splitLine[0] embedding = np.array([float(val) for val in splitLine[1:]]) model[word] = embedding print ("Done.",len(model)," words loaded!") return model model= loadGloveModel(file) print (model['hello']) """ Below is the output of the above code Loading Glove Model Done. 400000 words loaded! [-0.38497 0.80092 0.064106 -0.28355 -0.026759 -0.34532 -0.64253 -0.11729 -0.33257 0.55243 -0.087813 0.9035 0.47102 0.56657 0.6985 -0.35229 -0.86542 0.90573 0.03576 -0.071705 -0.12327 0.54923 0.47005 0.35572 1.2611 -0.67581 -0.94983 0.68666 0.3871 -1.3492 0.63512 0.46416 -0.48814 0.83827 -0.9246 -0.33722 0.53741 -1.0616 -0.081403 -0.67111 0.30923 -0.3923 -0.55002 -0.68827 0.58049 -0.11626 0.013139 -0.57654 0.048833 0.67204 ] """
So we got numerical representation of word ‘hello’.
We can use also pandas to load GloVe file. Below are functions for loading with pandas and getting vector information.
import pandas as pd import csv words = pd.read_table(file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE) def vec(w): return words.loc[w].as_matrix() print (vec('hello')) #this will print same as print (model['hello']) before
Finding Closest Word or Words
Now how do we find closest word to word “table”? We iterate through pandas dataframe, find deltas and then use numpy argmin function.
The closest word to some word will be always this word itself (as delta = 0) so I needed to drop the word ‘table’ and also next closest word ‘tables’. The final output for the closest word was “place”
words = words.drop("table", axis=0) words = words.drop("tables", axis=0) words_matrix = words.as_matrix() def find_closest_word(v): diff = words_matrix - v delta = np.sum(diff * diff, axis=1) i = np.argmin(delta) return words.iloc[i].name print (find_closest_word(model['table'])) #output: place #If we want retrieve more than one closest words here is the function: def find_N_closest_word(v, N, words): Nwords=[] for w in range(N): diff = words.as_matrix() - v delta = np.sum(diff * diff, axis=1) i = np.argmin(delta) Nwords.append(words.iloc[i].name) words = words.drop(words.iloc[i].name, axis=0) return Nwords print (find_N_closest_word(model['table'], 10, words)) #Output: #['table', 'tables', 'place', 'sit', 'set', 'hold', 'setting', 'here', 'placing', 'bottom']
We can also use gensim word2vec library functionalities after we load GloVe file.
from gensim.scripts.glove2word2vec import glove2word2vec glove2word2vec(glove_input_file=file, word2vec_output_file="gensim_glove_vectors.txt") ###Finally, read the word2vec txt to a gensim model using KeyedVectors: from gensim.models.keyedvectors import KeyedVectors glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)
Difference between word2vec and GloVe
Both models learn geometrical encodings (vectors) of words from their co-occurrence information. They differ in the way how they learn this information. word2vec is using a “predictive” model (feed-forward neural network), whereas GloVe is using a “count-based” model (dimensionality reduction on the co-occurrence counts matrix). [3]
I hope you enjoyed reading this post about how to convert word to vector with GloVe and python. If you have any tips or anything else to add, please leave a comment below.
References
1. GloVe: Global Vectors for Word Representation
2. Load pretrained glove vectors in python
3. How is GloVe different from word2vec
4. Don’t count, predict! A systematic comparison of
context-counting vs. context-predicting semantic vectors
5. Words Embeddings
2 thoughts on “How to Convert Word to Vector with GloVe and Python”