How to Extract Text from Website

Extracting data from the Web using scripts (web scraping) is widely used today for numerous purposes. One of the parts of this process is downloading actual text from urls. This will be the topic of this post.

We will consider how it can be done using the following case examples:
Extracting information from visited links of history of using Chrome browser.

Extracting information from list of links. For example in the previous post we looked how to extract links from twitter search results into csv file. This file will be now the source of links.

Below will follow the python script implementation of main parts. It is using few code snippets and posts from the web. References and full source code are provided in the end.

Switching Between Cases
The script is using a variable USE_LINKS_FROM_CHROME_HISTORY to select correct program flow. If USE_LINKS_FROM_CHROME_HISTORY is true it will start extract links from Chrome, otherwise will use file with links.

results=[]
if  USE_LINKS_FROM_CHROME_HISTORY:
        results =  get_links_from_chrome_history() 
        fname="data_from_chrome_history_links.csv"
else:
        results=get_links_from_csv_file()
        fname="data_from_file_links.csv"

Extracting Content From HTML Links
We use python libraries BeautifulSoup for processing HTML and requests library for downloading HTML:

from bs4 import BeautifulSoup
from bs4.element import Comment
import requests

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def get_text(url):
   print (url) 
   
   try:
      req  = requests.get(url, timeout=5)
   except: 
      return "TIMEOUT ERROR"  
  
   data = req.text
   soup = BeautifulSoup(data, "html.parser")
   texts = soup.findAll(text=True)
   visible_texts = filter(tag_visible, texts)  
   return u" ".join(t.strip() for t in visible_texts)

Extracting Content from PDF Format with PDF to Text Python

Not all links will give html page. Some might lead to pdf data format page. For this we need to use specific process of getting text from pdf. There are several solutions possible. Here we will use pdftotext exe file. [2] With this method we create function as below and call it when url ends with “.pdf”.

To make actual conversion from pdf to txt we use subprocess.call and provide location of pdftotext.exe file, filename of pdf file and filename of new txt file. Note that we first download pdf page to pdf file on local drive.

import subprocess
def get_txt_from_pdf(url):
    myfile = requests.get(url, timeout=8)
    myfile_name=url.split("/")[-1] 
    myfile_name_wout_ext=myfile_name[0:-4]
    open('C:\\Users\\username\\Downloads\\' + myfile_name, 'wb').write(myfile.content)
    subprocess.call(['C:\\Users\\username\\pythonrun\\pdftotext' + '\\pdftotext', myfile_name, myfile_name_wout_ext+".txt"])
    with open('C:\\Users\\username\\Downloads\\' + myfile_name_wout_ext+".txt", 'r') as content_file:
        content = content_file.read()
    return content  

 if url.endswith(".pdf"):
                  txt = get_txt_from_pdf(full_url)

Cleaning Extracted Text
Once text is extracted from pdf or html we need to remove not useful text.
Below are processing actions that are implemented in the script:

  • remove non content text like script, html, tags (it is only for html pages)
  • remove non text characters
  • remove repeating spaces
  • remove documents if the size of document less then some min number of characters (MIN_LENGTH_of_document)
  • remove bad requests results – for example the request to get content from specific link was not successful but still resulted in some text.

Getting Links from Chrome History
To get visited links we need query Chrome web browser database with simple SQL statement. This is well described on some other web blogs. You can find link also in the references below [1].

Additionally when we extracting from Chrome history we need remove links that are out of scope – example you are extracting links that you used for reading about data mining. So links where you access your banking site or friends on facebook are not related.

To sort out not related links we can insert in sql statement filtering criteria with NOT Like * or <> as below:
select_statement = “SELECT urls.url FROM urls WHERE urls.url NOT Like ‘%localhost%’ AND urls.url NOT Like ‘%google%’ AND urls.visit_count > 0 AND urls.url <> ‘https://www.reddit.com/’ ;”

Conclusion
We learned how to extract text from website (pdf or html). We built the script for two practical examples: when we use links from Chrome web browser history or when we have list of links extracted from somewhere, for example from Twitter search results. The next step would be extract insights from the obtained text data using machine learning or text mining. For example from chrome history we could build frequent questions that developer searches in the web browser and create faster way to access information.

# -*- coding: utf-8 -*-

import os
import sqlite3
import operator
from collections import OrderedDict

import time
import csv

from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
import re
import subprocess


MIN_LENGTH_of_document = 40
MIN_LENGTH_of_word = 2
USE_LINKS_FROM_CHROME_HISTORY = False #if false will use from csv file

def remove_min_words(txt):
   
   shortword = re.compile(r'\W*\b\w{1,1}\b')
   return(shortword.sub('', txt))


def clean_txt(text):
   text = re.sub('[^A-Za-z.  ]', ' ', text)
   text=' '.join(text.split())
   text = remove_min_words(text)
   text=text.lower()
   text = text if  len(text) >= MIN_LENGTH_of_document else ""
   return text

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


  
    
def get_txt_from_pdf(url):
    myfile = requests.get(url, timeout=8)
    myfile_name=url.split("/")[-1] 
    myfile_name_wout_ext=myfile_name[0:-4]
    open('C:\\Users\\username\\Downloads\\' + myfile_name, 'wb').write(myfile.content)
    subprocess.call(['C:\\Users\\username\\pythonrun\\pdftotext' + '\\pdftotext', myfile_name, myfile_name_wout_ext+".txt"])
    with open('C:\\Users\\username\\Downloads\\' + myfile_name_wout_ext+".txt", 'r') as content_file:
        content = content_file.read()
    return content    


def get_text(url):
   print (url) 
   
   try:
      req  = requests.get(url, timeout=5)
   except: 
      return "TIMEOUT ERROR"  
  
   data = req.text
   soup = BeautifulSoup(data, "html.parser")
   texts = soup.findAll(text=True)
   visible_texts = filter(tag_visible, texts)  
   return u" ".join(t.strip() for t in visible_texts)


def parse(url):
	try:
		parsed_url_components = url.split('//')
		sublevel_split = parsed_url_components[1].split('/', 1)
		domain = sublevel_split[0].replace("www.", "")
		return domain
	except IndexError:
		print ("URL format error!")


def get_links_from_chrome_history():
   #path to user's history database (Chrome)
   data_path = os.path.expanduser('~')+"\\AppData\\Local\\Google\\Chrome\\User Data\\Default"
 
   history_db = os.path.join(data_path, 'history')

   #querying the db
   c = sqlite3.connect(history_db)
   cursor = c.cursor()
   select_statement = "SELECT urls.url FROM urls WHERE urls.url NOT Like '%localhost%' AND urls.url NOT Like '%google%' AND urls.visit_count > 0 AND urls.url <> 'https://www.reddit.com/' ;"
   cursor.execute(select_statement)

   results_tuples = cursor.fetchall() 
  
   return ([x[0] for x in results_tuples])
   
   
def get_links_from_csv_file():
   links_from_csv = []
   
   filename = 'C:\\Users\\username\\pythonrun\\links.csv'
   col_id=0
   with open(filename, newline='', encoding='utf-8-sig') as f:
      reader = csv.reader(f)
     
      try:
        for row in reader:
            
            links_from_csv.append(row[col_id])
      except csv.Error as e:
        print('file {}, line {}: {}'.format(filename, reader.line_num, e))
   return links_from_csv   
   
 
results=[]
if  USE_LINKS_FROM_CHROME_HISTORY:
        results =  get_links_from_chrome_history() 
        fname="data_from_chrome_history_links.csv"
else:
        results=get_links_from_csv_file()
        fname="data_from_file_links.csv"
        
        

sites_count = {} 
full_sites_count = {}



with open(fname, 'w', encoding="utf8", newline='' ) as csvfile: 
  fieldnames = ['URL', 'URL Base', 'TXT']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()

  
  count_url=0
  for url in results:    
      print (url)
      full_url=url
      url = parse(url)
      
      if full_url in full_sites_count:
            full_sites_count[full_url] += 1
      else:
            full_sites_count[full_url] = 1
          
            if url.endswith(".pdf"):
                  txt = get_txt_from_pdf(full_url)
            else:
                  txt = get_text(full_url)
            txt=clean_txt(txt)
            writer.writerow({'URL': full_url, 'URL Base': url, 'TXT': txt})
            time.sleep(4)
      
      
      
     
      if url in sites_count:
            sites_count[url] += 1
      else:
            sites_count[url] = 1
   
      count_url +=1

References
1. Analyze Chrome’s Browsing History with Python
2. XpdfReader
3. Python: Remove words from a string of length between 1 and a given number
4. BeautifulSoup Grab Visible Webpage Text
5. Web Scraping 101 with Python & Beautiful Soup
6. Downloading Files Using Python (Simple Examples)
7. Introduction to web scraping in Python
8. Ultimate guide to deal with Text Data (using Python) – for Data Scientists and Engineers