How to Extract Text from Website

Extracting data from the Web using scripts (web scraping) is widely used today for numerous purposes. One of the parts of this process is downloading actual text from urls. This will be the topic of this post.

We will consider how it can be done using the following case examples:
Extracting information from visited links of history of using Chrome browser.

Extracting information from list of links. For example in the previous post we looked how to extract links from twitter search results into csv file. This file will be now the source of links.

Below will follow the python script implementation of main parts. It is using few code snippets and posts from the web. References and full source code are provided in the end.

Switching Between Cases
The script is using a variable USE_LINKS_FROM_CHROME_HISTORY to select correct program flow. If USE_LINKS_FROM_CHROME_HISTORY is true it will start extract links from Chrome, otherwise will use file with links.

results=[]
if  USE_LINKS_FROM_CHROME_HISTORY:
        results =  get_links_from_chrome_history() 
        fname="data_from_chrome_history_links.csv"
else:
        results=get_links_from_csv_file()
        fname="data_from_file_links.csv"

Extracting Content From HTML Links
We use python libraries BeautifulSoup for processing HTML and requests library for downloading HTML:

from bs4 import BeautifulSoup
from bs4.element import Comment
import requests

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def get_text(url):
   print (url) 
   
   try:
      req  = requests.get(url, timeout=5)
   except: 
      return "TIMEOUT ERROR"  
  
   data = req.text
   soup = BeautifulSoup(data, "html.parser")
   texts = soup.findAll(text=True)
   visible_texts = filter(tag_visible, texts)  
   return u" ".join(t.strip() for t in visible_texts)

Extracting Content from PDF Format with PDF to Text Python

Not all links will give html page. Some might lead to pdf data format page. For this we need to use specific process of getting text from pdf. There are several solutions possible. Here we will use pdftotext exe file. [2] With this method we create function as below and call it when url ends with “.pdf”.

To make actual conversion from pdf to txt we use subprocess.call and provide location of pdftotext.exe file, filename of pdf file and filename of new txt file. Note that we first download pdf page to pdf file on local drive.

import subprocess
def get_txt_from_pdf(url):
    myfile = requests.get(url, timeout=8)
    myfile_name=url.split("/")[-1] 
    myfile_name_wout_ext=myfile_name[0:-4]
    open('C:\\Users\\username\\Downloads\\' + myfile_name, 'wb').write(myfile.content)
    subprocess.call(['C:\\Users\\username\\pythonrun\\pdftotext' + '\\pdftotext', myfile_name, myfile_name_wout_ext+".txt"])
    with open('C:\\Users\\username\\Downloads\\' + myfile_name_wout_ext+".txt", 'r') as content_file:
        content = content_file.read()
    return content  

 if url.endswith(".pdf"):
                  txt = get_txt_from_pdf(full_url)

Cleaning Extracted Text
Once text is extracted from pdf or html we need to remove not useful text.
Below are processing actions that are implemented in the script:

  • remove non content text like script, html, tags (it is only for html pages)
  • remove non text characters
  • remove repeating spaces
  • remove documents if the size of document less then some min number of characters (MIN_LENGTH_of_document)
  • remove bad requests results – for example the request to get content from specific link was not successful but still resulted in some text.

Getting Links from Chrome History
To get visited links we need query Chrome web browser database with simple SQL statement. This is well described on some other web blogs. You can find link also in the references below [1].

Additionally when we extracting from Chrome history we need remove links that are out of scope – example you are extracting links that you used for reading about data mining. So links where you access your banking site or friends on facebook are not related.

To sort out not related links we can insert in sql statement filtering criteria with NOT Like * or <> as below:
select_statement = “SELECT urls.url FROM urls WHERE urls.url NOT Like ‘%localhost%’ AND urls.url NOT Like ‘%google%’ AND urls.visit_count > 0 AND urls.url <> ‘https://www.reddit.com/’ ;”

Conclusion
We learned how to extract text from website (pdf or html). We built the script for two practical examples: when we use links from Chrome web browser history or when we have list of links extracted from somewhere, for example from Twitter search results. The next step would be extract insights from the obtained text data using machine learning or text mining. For example from chrome history we could build frequent questions that developer searches in the web browser and create faster way to access information.

# -*- coding: utf-8 -*-

import os
import sqlite3
import operator
from collections import OrderedDict

import time
import csv

from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
import re
import subprocess


MIN_LENGTH_of_document = 40
MIN_LENGTH_of_word = 2
USE_LINKS_FROM_CHROME_HISTORY = False #if false will use from csv file

def remove_min_words(txt):
   
   shortword = re.compile(r'\W*\b\w{1,1}\b')
   return(shortword.sub('', txt))


def clean_txt(text):
   text = re.sub('[^A-Za-z.  ]', ' ', text)
   text=' '.join(text.split())
   text = remove_min_words(text)
   text=text.lower()
   text = text if  len(text) >= MIN_LENGTH_of_document else ""
   return text

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


  
    
def get_txt_from_pdf(url):
    myfile = requests.get(url, timeout=8)
    myfile_name=url.split("/")[-1] 
    myfile_name_wout_ext=myfile_name[0:-4]
    open('C:\\Users\\username\\Downloads\\' + myfile_name, 'wb').write(myfile.content)
    subprocess.call(['C:\\Users\\username\\pythonrun\\pdftotext' + '\\pdftotext', myfile_name, myfile_name_wout_ext+".txt"])
    with open('C:\\Users\\username\\Downloads\\' + myfile_name_wout_ext+".txt", 'r') as content_file:
        content = content_file.read()
    return content    


def get_text(url):
   print (url) 
   
   try:
      req  = requests.get(url, timeout=5)
   except: 
      return "TIMEOUT ERROR"  
  
   data = req.text
   soup = BeautifulSoup(data, "html.parser")
   texts = soup.findAll(text=True)
   visible_texts = filter(tag_visible, texts)  
   return u" ".join(t.strip() for t in visible_texts)


def parse(url):
	try:
		parsed_url_components = url.split('//')
		sublevel_split = parsed_url_components[1].split('/', 1)
		domain = sublevel_split[0].replace("www.", "")
		return domain
	except IndexError:
		print ("URL format error!")


def get_links_from_chrome_history():
   #path to user's history database (Chrome)
   data_path = os.path.expanduser('~')+"\\AppData\\Local\\Google\\Chrome\\User Data\\Default"
 
   history_db = os.path.join(data_path, 'history')

   #querying the db
   c = sqlite3.connect(history_db)
   cursor = c.cursor()
   select_statement = "SELECT urls.url FROM urls WHERE urls.url NOT Like '%localhost%' AND urls.url NOT Like '%google%' AND urls.visit_count > 0 AND urls.url <> 'https://www.reddit.com/' ;"
   cursor.execute(select_statement)

   results_tuples = cursor.fetchall() 
  
   return ([x[0] for x in results_tuples])
   
   
def get_links_from_csv_file():
   links_from_csv = []
   
   filename = 'C:\\Users\\username\\pythonrun\\links.csv'
   col_id=0
   with open(filename, newline='', encoding='utf-8-sig') as f:
      reader = csv.reader(f)
     
      try:
        for row in reader:
            
            links_from_csv.append(row[col_id])
      except csv.Error as e:
        print('file {}, line {}: {}'.format(filename, reader.line_num, e))
   return links_from_csv   
   
 
results=[]
if  USE_LINKS_FROM_CHROME_HISTORY:
        results =  get_links_from_chrome_history() 
        fname="data_from_chrome_history_links.csv"
else:
        results=get_links_from_csv_file()
        fname="data_from_file_links.csv"
        
        

sites_count = {} 
full_sites_count = {}



with open(fname, 'w', encoding="utf8", newline='' ) as csvfile: 
  fieldnames = ['URL', 'URL Base', 'TXT']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()

  
  count_url=0
  for url in results:    
      print (url)
      full_url=url
      url = parse(url)
      
      if full_url in full_sites_count:
            full_sites_count[full_url] += 1
      else:
            full_sites_count[full_url] = 1
          
            if url.endswith(".pdf"):
                  txt = get_txt_from_pdf(full_url)
            else:
                  txt = get_text(full_url)
            txt=clean_txt(txt)
            writer.writerow({'URL': full_url, 'URL Base': url, 'TXT': txt})
            time.sleep(4)
      
      
      
     
      if url in sites_count:
            sites_count[url] += 1
      else:
            sites_count[url] = 1
   
      count_url +=1

References
1. Analyze Chrome’s Browsing History with Python
2. XpdfReader
3. Python: Remove words from a string of length between 1 and a given number
4. BeautifulSoup Grab Visible Webpage Text
5. Web Scraping 101 with Python & Beautiful Soup
6. Downloading Files Using Python (Simple Examples)
7. Introduction to web scraping in Python
8. Ultimate guide to deal with Text Data (using Python) – for Data Scientists and Engineers

Twitter Text Mining with Python

In this post (and few following posts) we will look how to get interesting information by extracting links from results of Twitter search by keywords and using machine learning text mining. While there many other posts on the same topic, we will cover also additional small steps that are needed to process data. This includes such tasks as unshorting urls, setting date interval, saving or reading information.

Below we will focus on extracting links from results of Twitter search API python.

Getting Login Information for Twitter API

The first step is set up application on Twitter and get login information. This is already described in some posts on the web [1].
Below is the code snippet for this:

your code here
import tweepy as tw
    
CONSUMER_KEY ="xxxxx"
CONSUMER_SECRET ="xxxxxxx"
OAUTH_TOKEN = "xxxxx"
OAUTH_TOKEN_SECRET = "xxxxxx"

auth = tw.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tw.API(auth, wait_on_rate_limit=True)

Defining the Search Values

Now you can do search by keywords or hashtags and get tweets.
When we do search we might want to specify the start day so it will give results dated on this start day or after.

For this we can code as the following:

from datetime import datetime
from datetime import timedelta

NUMBER_of_TWEETS = 20
SEARCH_BEHIND_DAYS=60
today_date=datetime.today().strftime('%Y-%m-%d')


today_date_datef = datetime.strptime(today_date, '%Y-%m-%d')
start_date = today_date_datef - timedelta(days=SEARCH_BEHIND_DAYS)


for search_term in search_terms:
  tweets = tw.Cursor(api.search,
                   q=search_term,
                   lang="en",
                   since=SEARCH_BEHIND_DAYS).items(NUMBER_of_TWEETS)

The above search will return 20 tweets and will look only within 60 days from day of search. If we want use fixed date we can replace with since=’2019-12-01′

Processing Extracted Links

Once we got tweets text we can extract links. However we will get different types of links. Some are internal twitter links, some are shorten, some are regular urls.

So here is the function to sort out the links. We do not need internal links – the links that belong to Twitter navigation or other functionality.

try:
    import urllib.request as urllib2
except ImportError:
    import urllib2


import http.client
import urllib.parse as urlparse   

def unshortenurl(url):
    parsed = urlparse.urlparse(url) 
    h = http.client.HTTPConnection(parsed.netloc) 
    h.request('HEAD', parsed.path) 
    response = h.getresponse() 
    if response.status >= 300 and response.status < 400 and response.getheader('Location'):
        return response.getheader('Location') 
    else: return url 

Once we got links we can save urls information in csv file. Together with the link we save twitter text, date.
Additionally we count number of hashtags and links and also save this information into csv files. So the output of program is 3 csv files.

Conclusion

Looking in the output file we can quickly identify the links of interest. For example just during the testing this script I found two interesting links that I was not aware. In the following post we will look how to do even more automation for finding cool links using Twitter text mining.

Below you can find full source code and the references to web resources that were used for this post or related to this topic.

# -*- coding: utf-8 -*-

import tweepy as tw
import re
import csv

from datetime import datetime
from datetime import timedelta

NUMBER_of_TWEETS = 20
SEARCH_BEHIND_DAYS=60
today_date=datetime.today().strftime('%Y-%m-%d')


today_date_datef = datetime.strptime(today_date, '%Y-%m-%d')
start_date = today_date_datef - timedelta(days=SEARCH_BEHIND_DAYS)
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2


import http.client
import urllib.parse as urlparse   

def unshortenurl(url):
    parsed = urlparse.urlparse(url) 
    h = http.client.HTTPConnection(parsed.netloc) 
    h.request('HEAD', parsed.path) 
    response = h.getresponse() 
    if response.status >= 300 and response.status < 400 and response.getheader('Location'):
        return response.getheader('Location') 
    else: return url    
    
    
CONSUMER_KEY ="xxxxx"
CONSUMER_SECRET ="xxxxxxx"
OAUTH_TOKEN = "xxxxxxxx"
OAUTH_TOKEN_SECRET = "xxxxxxx"


auth = tw.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tw.API(auth, wait_on_rate_limit=True)
# Create a custom search term 

search_terms=["#chatbot -filter:retweets", 
              "#chatbot+machine_learning -filter:retweets", 
              "#chatbot+python -filter:retweets",
              "text classification -filter:retweets",
              "text classification python -filter:retweets",
              "machine learning applications -filter:retweets",
              "sentiment analysis python  -filter:retweets",
              "sentiment analysis  -filter:retweets"]
              
        
              
def count_urls():
       url_counted = dict() 
       url_count = dict()
       with open('data.csv', 'r', encoding="utf8" ) as csvfile: 
           line = csvfile.readline()
           while line != '':  # The EOF char is an empty string
            
               line = csvfile.readline()
               items=line.split(",")
               if len(items) < 3 :
                          continue
                           
               url=items[1]
               twt=items[2]
               # key =  Tweet and Url
               key=twt[:30] + "___" + url
               
               if key not in url_counted:
                      url_counted[key]=1
                      if url in url_count:
                           url_count[url] += 1
                      else:
                           url_count[url] = 1
       print_count_urls(url_count)             

       
def print_count_urls(url_count_data):
   
         for key, value in url_count_data.items():
              print (key, "=>", value)
              
         with open('data_url_count.csv', 'w', encoding="utf8", newline='' ) as csvfile_link_count: 
            fieldnames = ['URL', 'Count']
            writer = csv.DictWriter(csvfile_link_count, fieldnames=fieldnames)
            writer.writeheader() 
            
            for key, value in url_count_data.items():
                 writer.writerow({'URL': key, 'Count': value })   
            
           
def extract_hash_tags(s):
    return set(part[1:] for part in s.split() if part.startswith('#'))
    

   
def save_tweet_info(tw, twt_dict, htags_dict ):
   
    if tw not in twt_dict:
        htags=extract_hash_tags(tw)
        twt_dict[tw]=1
        for ht in htags:
            if ht in htags_dict:
                htags_dict[ht]=htags_dict[ht]+1
            else:   
                htags_dict[ht]=1


def print_count_hashtags(htags_count_data):
        
         for key, value in htags_count_data.items():
              print (key, "=>", value)
              
         with open('data_htags_count.csv', 'w', encoding="utf8", newline='' ) as csvfile_link_count: 
            fieldnames = ['Hashtag', 'Count']
            writer = csv.DictWriter(csvfile_link_count, fieldnames=fieldnames)
            writer.writeheader() 
            
            for key, value in htags_count_data.items():
                 writer.writerow({'Hashtag': key, 'Count': value })          
        


tweet_dict = dict() 
hashtags_dict = dict()

                 
for search_term in search_terms:
  tweets = tw.Cursor(api.search,
                   q=search_term,
                   lang="en",
                   #since='2019-12-01').items(40)
                   since=SEARCH_BEHIND_DAYS).items(NUMBER_of_TWEETS)

  with open('data.csv', 'a', encoding="utf8", newline='' ) as csvfile: 
     fieldnames = ['Search', 'URL', 'Tweet', 'Entered on']
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
     writer.writeheader()
     

     for tweet in tweets:
         urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
   
         save_tweet_info(tweet.text, tweet_dict, hashtags_dict ) 
         for url in urls:
          try:
            res = urllib2.urlopen(url)
            actual_url = res.geturl()
         
            if ( ("https://twitter.com" in actual_url) == False):
                
                if len(actual_url) < 32:
                    actual_url =unshortenurl(actual_url) 
                print (actual_url)
              
                writer.writerow({'Search': search_term, 'URL': actual_url, 'Tweet': tweet.text, 'Entered on': today_date })
              
          except:
              print (url)    

            
print_count_hashtags(hashtags_dict)
count_urls()      

References

1. Text mining: Twitter extraction and stepwise guide to generate a word cloud
2. Analyze Word Frequency Counts Using Twitter Data and Tweepy in Python
3. unshorten-url-in-python-3
4. how-can-i-un-shorten-a-url-using-python
5. extracting-external-links-from-tweets-in-python

Automatic Text Summarization with Python

Automatic text summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. The main idea of summarization is to find a subset of data which contains the “information” of the entire set. Such techniques are widely used in industry today. [1]

In this post we will review several methods of implementing text data summarization techniques with python. We will use different python libraries.

Text Summarization with Gensim

1. Our first example is using gensim – well know python library for topic modeling. Below is the example with summarization.summarizer from gensim. This module provides functions for summarizing texts. Summarizing is based on ranks of text sentences using a variation of the TextRank algorithm. [2]

TextRank is a general purpose graph-based ranking algorithm for NLP. Essentially, it runs PageRank on a graph specially designed for a particular NLP task. For keyphrase extraction, it builds a graph using some set of text units as vertices. Edges are based on some measure of semantic or lexical similarity between the text unit vertices[1].

 
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

import requests

# getting text document from Internet
text = requests.get('http://rare-technologies.com/the_matrix_synopsis.txt').text


# getting text document from file
fname="C:\\Users\\TextRank-master\\wikipedia_deep_learning.txt"
with open(fname, 'r') as myfile:
      text=myfile.read()
    
    
#getting text document from web, below function based from 3
from bs4 import BeautifulSoup
from urllib.request import urlopen

def get_only_text(url):
 """ 
  return the title and the text of the article
  at the specified url
 """
 page = urlopen(url)
 soup = BeautifulSoup(page, "lxml")
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text, text    

 
print ('Summary:')
print (summarize(text, ratio=0.01))

print ('\nKeywords:')
print (keywords(text, ratio=0.01))

url="https://en.wikipedia.org/wiki/Deep_learning"
text = get_only_text(url)

print ('Summary:')   
print (summarize(str(text), ratio=0.01))

print ('\nKeywords:')

# higher ratio => more keywords
print (keywords(str(text), ratio=0.01))

Here is the result for link https://en.wikipedia.org/wiki/Deep_learning
Summary:
In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.[55] Later it was combined with connectionist temporal classification (CTC)[56] in stacks of LSTM RNNs.[57] In 2015, Google\’s speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.[58] In the early 2000s, CNNs processed an estimated 10% to 20% of all the checks written in the US.[59] In 2006, Hinton and Salakhutdinov showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation.[60] Deep learning is part of state-of-the-art systems in various disciplines, particularly computer vision and automatic speech recognition (ASR).

Keywords:
deep learning
learned
learn
learns
layer
layered
layers
models
model
modeling
images
image
recognition
data
networks
network
trained
training
train
trains

Text Summarization using NLTK and Frequencies of Words

2. Our 2nd method is word frequency analysis provided on The Glowing Python blog [3]. Below is the example how it can be used. Note that you need FrequencySummarizer code from [3] and put it in separate file in file named FrequencySummarizer.py in the same folder. The code is using NLTK library.

 
#note FrequencySummarizer is need to be copied from
# https://glowingpython.blogspot.com/2014/09/text-summarization-with-nltk.html
# and saved as FrequencySummarizer.py in the same folder that this
# script
from FrequencySummarizer import FrequencySummarizer


from bs4 import BeautifulSoup
from urllib.request import urlopen


def get_only_text(url):
 """ 
  return the title and the text of the article
  at the specified url
 """
 
 page = urlopen(url)
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 
 print ("=====================")
 print (text)
 print ("=====================")

 return soup.title.text, text    

    
url="https://en.wikipedia.org/wiki/Deep_learning"
text = get_only_text(url)    

fs = FrequencySummarizer()
s = fs.summarize(str(text), 5)
print (s)

3. Here is the link to another example for building summarizer with python and NLTK.
This Summarizer is also based on frequency words – it creates frequency table of words – how many times each word appears in the text and assign score to each sentence depending on the words it contains and the frequency table.
The summary then built only with the sentences above a certain score threshold. [6]

Automatic Summarization Using Different Methods from Sumy

4. Our next example is based on sumy python module. Module for automatic summarization of text documents and HTML pages. Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods:

Luhn – heurestic method
Edmundson heurestic method with previous statistic research
Latent Semantic Analysis
LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
TextRank
SumBasic – Method that is often used as a baseline in the literature
KL-Sum – Method that greedily adds sentences to a summary so long as it decreases the KL Divergence. [5]

Below is the example how to use different summarizes. The usage most of them similar but for EdmundsonSummarizer we need also to enter bonus_words, stigma_words, null_words. Bonus_words are the words that we want to see in summary they are most informative and are significant words. Stigma words are unimportant words. We can use tf-idf value from information retrieval to get the list of key words.

 
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer   #found this is the best as 
# it is picking from beginning also while other skip


LANGUAGE = "english"
SENTENCES_COUNT = 10


if __name__ == "__main__":
   
    url="https://en.wikipedia.org/wiki/Deep_learning"
  
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
   

       
    print ("--LsaSummarizer--")    
    summarizer = LsaSummarizer()
    summarizer = LsaSummarizer(Stemmer(LANGUAGE))
    summarizer.stop_words = get_stop_words(LANGUAGE)
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
        
    print ("--LuhnSummarizer--")     
    summarizer = LuhnSummarizer() 
    summarizer = LsaSummarizer(Stemmer(LANGUAGE))
    summarizer.stop_words = ("I", "am", "the", "you", "are", "me", "is", "than", "that", "this",)
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
        
    print ("--EdmundsonSummarizer--")     
    summarizer = EdmundsonSummarizer() 
    words = ("deep", "learning", "neural" )
    summarizer.bonus_words = words
    
    words = ("another", "and", "some", "next",)
    summarizer.stigma_words = words
   
    
    words = ("another", "and", "some", "next",)
    summarizer.null_words = words
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)     

I hope you enjoyed this post review about automatic text summarization methods with python. If you have any tips or anything else to add, please leave a comment below.

References
1. Automatic_summarization
2. Gensim
3. text-summarization-with-nltk
4. Nullege Python Search Code
5. sumy 0.7.0
6. Build a quick Summarizer with Python and NLTK
7. text-summarization-with-gensim

How to Convert Word to Vector with GloVe and Python

In the previous post we looked at Vector Representation of Text with word embeddings using word2vec. Another approach that can be used to convert word to vector is to use GloVe – Global Vectors for Word Representation. Per documentation from home page of GloVe [1] “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus”. Thus we can convert word to vector using GloVe.

At this post we will look how to use pretrained GloVe data file that can be downloaded from [1].
word embeddings GloVe We will look how to get word vector representation from this downloaded datafile. We will also look how to get nearest words. Why do we need vector representation of text? Because this is what we input to machine learning or data science algorithms – we feed numerical vectors to algorithms such as text classification, machine learning clustering or other text analytics algorithms.

Loading Glove Datafile

The code that I put here is based on some examples that I found on StackOverflow [2].

So first you need to open the file and load data into the model. Then you can get the vector representation and other things.

Below is the full source code for glove python script:

file = "C:\\Users\\glove\\glove.6B.50d.txt"
import numpy as np
def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
   
    
    with open(gloveFile, encoding="utf8" ) as f:
       content = f.readlines()
    model = {}
    for line in content:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model
    
    
model= loadGloveModel(file)   

print (model['hello'])

"""
Below is the output of the above code
Loading Glove Model
Done. 400000  words loaded!
[-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
  0.67204 ]
"""  

So we got numerical representation of word ‘hello’.
We can use also pandas to load GloVe file. Below are functions for loading with pandas and getting vector information.

import pandas as pd
import csv

words = pd.read_table(file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)


def vec(w):
  return words.loc[w].as_matrix()
 

print (vec('hello'))    #this will print same as print (model['hello'])  before
 

Finding Closest Word or Words

Now how do we find closest word to word “table”? We iterate through pandas dataframe, find deltas and then use numpy argmin function.
The closest word to some word will be always this word itself (as delta = 0) so I needed to drop the word ‘table’ and also next closest word ‘tables’. The final output for the closest word was “place”

words = words.drop("table", axis=0)  
words = words.drop("tables", axis=0)  

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name 


print (find_closest_word(model['table']))
#output:  place

#If we want retrieve more than one closest words here is the function:

def find_N_closest_word(v, N, words):
  Nwords=[]  
  for w in range(N):  
     diff = words.as_matrix() - v
     delta = np.sum(diff * diff, axis=1)
     i = np.argmin(delta)
     Nwords.append(words.iloc[i].name)
     words = words.drop(words.iloc[i].name, axis=0)
    
  return Nwords
  
  
print (find_N_closest_word(model['table'], 10, words)) 

#Output:
#['table', 'tables', 'place', 'sit', 'set', 'hold', 'setting', 'here', 'placing', 'bottom']

We can also use gensim word2vec library functionalities after we load GloVe file.

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file=file, word2vec_output_file="gensim_glove_vectors.txt")

###Finally, read the word2vec txt to a gensim model using KeyedVectors:

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

Difference between word2vec and GloVe

Both models learn geometrical encodings (vectors) of words from their co-occurrence information. They differ in the way how they learn this information. word2vec is using a “predictive” model (feed-forward neural network), whereas GloVe is using a “count-based” model (dimensionality reduction on the co-occurrence counts matrix). [3]

I hope you enjoyed reading this post about how to convert word to vector with GloVe and python. If you have any tips or anything else to add, please leave a comment below.

References
1. GloVe: Global Vectors for Word Representation
2. Load pretrained glove vectors in python
3. How is GloVe different from word2vec
4. Don’t count, predict! A systematic comparison of
context-counting vs. context-predicting semantic vectors

5. Words Embeddings

K Means Clustering Example with Word2Vec in Data Mining or Machine Learning

In this post you will find K means clustering example with word2vec in python code. Word2Vec is one of the popular methods in language modeling and feature learning techniques in natural language processing (NLP). This method is used to create word embeddings in machine learning whenever we need vector representation of data.

For example in data clustering algorithms instead of bag of words (BOW) model we can use Word2Vec. The advantage of using Word2Vec is that it can capture the distance between individual words.

The example in this post will demonstrate how to use results of Word2Vec word embeddings in clustering algorithms. For this, Word2Vec model will be feeded into several K means clustering algorithms from NLTK and Scikit-learn libraries.

Here we will do clustering at word level. Our clusters will be groups of words. In case we need to cluster at sentence or paragraph level, here is the link that showing how to move from word level to sentence/paragraph level:

Text Clustering with Word Embedding in Machine Learning

There is also doc2vec word embedding model that is based on word2vec. doc2vec is created for embedding sentence/paragraph/document. Here is the link how to use doc2vec word embedding in machine learning:
Text Clustering with doc2vec Word Embedding Machine Learning Model

Getting Word2vec

Using word2vec from python library gensim is simple and well described in tutorials and on the web [3], [4], [5]. Here we just look at basic example. For the input we use the sequence of sentences hard-coded in the script.

from gensim.models import Word2Vec
sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
                        ['this', 'is', 'about', 'machine', 'learning', 'post'],  
			['and', 'this', 'is', 'the', 'last', 'post']
model = Word2Vec(sentences, min_count=1)

Now we have model with words embedded. We can query model for similar words like below or ask to represent word as vector:

print (model.similarity('this', 'is'))
print (model.similarity('post', 'book'))
#output -0.0198180344218
#output -0.079446731287
print (model.most_similar(positive=['machine'], negative=[], topn=2))
#output: [('new', 0.24608060717582703), ('is', 0.06899910420179367)]
print (model['the'])
#output [-0.00217354 -0.00237131  0.00296396 ...,  0.00138597  0.00291924  0.00409528]

To get vocabulary or the number of words in vocabulary:

print (list(model.vocab))
print (len(list(model.vocab)))

This will produce: [‘good’, ‘this’, ‘post’, ‘another’, ‘learning’, ‘last’, ‘the’, ‘and’, ‘more’, ‘new’, ‘is’, ‘one’, ‘about’, ‘machine’, ‘book’]

Now we will feed word embeddings into clustering algorithm such as k Means which is one of the most popular unsupervised learning algorithms for finding interesting segments in the data. It can be used for separating customers into groups, combining documents into topics and for many other applications.

You will find below two k means clustering examples.

K Means Clustering with NLTK Library
Our first example is using k means algorithm from NLTK library.
To use word embeddings word2vec in machine learning clustering algorithms we initiate X as below:

X = model[model.vocab]

Now we can plug our X data into clustering algorithms.

from nltk.cluster import KMeansClusterer
import nltk
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
print (assigned_clusters)
# output: [0, 2, 1, 2, 2, 1, 2, 2, 0, 1, 0, 1, 2, 1, 2]

In the python code above there are several options for the distance as below:

nltk.cluster.util.cosine_distance(u, v)
Returns 1 minus the cosine of the angle between vectors v and u. This is equal to 1 – (u.v / |u||v|).

nltk.cluster.util.euclidean_distance(u, v)
Returns the euclidean distance between vectors u and v. This is equivalent to the length of the vector (u – v).

Here we use cosine distance to cluster our data.
After we got cluster results we can associate each word with the cluster that it got assigned to:

words = list(model.vocab)
for i, word in enumerate(words):  
    print (word + ":" + str(assigned_clusters[i]))

Here is the output for the above:
good:0
this:2
post:1
another:2
learning:2
last:1
the:2
and:2
more:0
new:1
is:0
one:1
about:2
machine:1
book:2

K Means Clustering with Scikit-learn Library

This example is based on k means from scikit-learn library.

from sklearn import cluster
from sklearn import metrics
kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)

print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(X))

silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')

print ("Silhouette_score: ")
print (silhouette_score)

In this example we also got some useful metrics to estimate clustering performance.
Output:

Cluster id labels for inputted data
[0 1 1 ..., 1 2 2]
Centroids data
[[ -3.82586889e-04   1.39791325e-03  -2.13839358e-03 ...,  -8.68172920e-04
   -1.23599875e-03   1.80053393e-03]
 [ -3.11774168e-04  -1.63297475e-03   1.76715955e-03 ...,  -1.43826099e-03
    1.22940990e-03   1.06353679e-03]
 [  1.91571176e-04   6.40696089e-04   1.38173658e-03 ...,  -3.26442620e-03
   -1.08828480e-03  -9.43636987e-05]]

Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):
-0.00894730946094
Silhouette_score: 
0.0427737

Here is the full python code of the script.

# -*- coding: utf-8 -*-



from gensim.models import Word2Vec

from nltk.cluster import KMeansClusterer
import nltk


from sklearn import cluster
from sklearn import metrics

# training data

sentences = [['this', 'is', 'the', 'good', 'machine', 'learning', 'book'],
			['this', 'is',  'another', 'book'],
			['one', 'more', 'book'],
			['this', 'is', 'the', 'new', 'post'],
          ['this', 'is', 'about', 'machine', 'learning', 'post'],  
			['and', 'this', 'is', 'the', 'last', 'post']]


# training model
model = Word2Vec(sentences, min_count=1)

# get vector data
X = model[model.vocab]
print (X)

print (model.similarity('this', 'is'))

print (model.similarity('post', 'book'))

print (model.most_similar(positive=['machine'], negative=[], topn=2))

print (model['the'])

print (list(model.vocab))

print (len(list(model.vocab)))




NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
print (assigned_clusters)

words = list(model.vocab)
for i, word in enumerate(words):  
    print (word + ":" + str(assigned_clusters[i]))



kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)

print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(X))

silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')

print ("Silhouette_score: ")
print (silhouette_score)

References
1. Word embedding
2. Comparative study of word embedding methods in topic segmentation
3. models.word2vec – Deep learning with word2vec
4. Word2vec Tutorial
5. How to Develop Word Embeddings in Python with Gensim
6. nltk.cluster package

Using Pretrained Word Embeddings in Machine Learning

In this post you will learn how to use pre-trained word embeddings in machine learning. Google provides News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors).

Download file from this link word2vec-GoogleNews-vectors and save it in some local folder. Open it with zip program and extract the .bin file. So instead of file GoogleNews-vectors-negative300.bin.gz you will have the file GoogleNews-vectors-negative300.bin

Now you can use the below snippet to load this file using gensim. Change the file path to actual file folder where you saved the file in the previous step.

Gensim
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. It is Python framework for fast Vector Space Modelling.

The below python code snippet demonstrates how to load pretrained Google file into the model and then query model for example for similarity between word.
# -*- coding: utf-8 -*-

import gensim

model = gensim.models.Word2Vec.load_word2vec_format('C:\\Users\\GoogleNews-vectors-negative300.bin', binary=True)  

vocab = model.vocab.keys()
wordsInVocab = len(vocab)
print (wordsInVocab)
print (model.similarity('this', 'is'))
print (model.similarity('post', 'book'))

Output from the above code:
3000000
0.407970363878
0.0572043891977

You can do all other things same way as if you would use own trained word embeddings. The Google file however is big, it is 1.5 GB original size, and unzipped it has 3.3GB. On my 6GB RAM laptop it took a while to run the below code. But it run it. However some other commands I was not able to run.

See this post K Means Clustering Example with Word2Vec which is showing embedding in machine learning algorithm. Here Word2Vec model will be feeded into several k-means clustering algorithms from NLTK and Scikit-learn libraries.

GloVe and fastText Word Embedding in Machine Learning

Word2vec is not the the only word embedding available for use. Below are the few links for other word embeddings.
Here How to Convert Word to Vector with GloVe and Python you will find how to convert word to vector with GloVe – Global Vectors for Word Representation. Detailed example is shown how to use pretrained GloVe data file that can be downloaded.

And one more link is here FastText Word Embeddings for Text Classification with MLP and Python In this post you will discover fastText word embeddings – how to load pretrained fastText, get text embeddings and use it in document classification example.

1. Google’s trained Word2Vec model in Python
2. word2vec-GoogleNews-vectors
3. gensim 3.1.0