Extracting data from the Web using scripts (web scraping) is widely used today for numerous purposes. One of the parts of this process is downloading actual text from urls. This will be the topic of this post.
We will consider how it can be done using the following case examples:
Extracting information from visited links of history of using Chrome browser.
Extracting information from list of links. For example in the previous post we looked how to extract links from twitter search results into csv file. This file will be now the source of links.
Below will follow the python script implementation of main parts. It is using few code snippets and posts from the web. References and full source code are provided in the end.
Switching Between Cases
The script is using a variable USE_LINKS_FROM_CHROME_HISTORY to select correct program flow. If USE_LINKS_FROM_CHROME_HISTORY is true it will start extract links from Chrome, otherwise will use file with links.
results=[] if USE_LINKS_FROM_CHROME_HISTORY: results = get_links_from_chrome_history() fname="data_from_chrome_history_links.csv" else: results=get_links_from_csv_file() fname="data_from_file_links.csv"
Extracting Content From HTML Links
We use python libraries BeautifulSoup for processing HTML and requests library for downloading HTML:
from bs4 import BeautifulSoup from bs4.element import Comment import requests def tag_visible(element): if element.parent.name in ['style', 'script', 'head', 'meta', '[document]']: return False if isinstance(element, Comment): return False return True def get_text(url): print (url) try: req = requests.get(url, timeout=5) except: return "TIMEOUT ERROR" data = req.text soup = BeautifulSoup(data, "html.parser") texts = soup.findAll(text=True) visible_texts = filter(tag_visible, texts) return u" ".join(t.strip() for t in visible_texts)
Extracting Content from PDF Format with PDF to Text Python
Not all links will give html page. Some might lead to pdf data format page. For this we need to use specific process of getting text from pdf. There are several solutions possible. Here we will use pdftotext exe file. [2] With this method we create function as below and call it when url ends with “.pdf”.
To make actual conversion from pdf to txt we use subprocess.call and provide location of pdftotext.exe file, filename of pdf file and filename of new txt file. Note that we first download pdf page to pdf file on local drive.
import subprocess def get_txt_from_pdf(url): myfile = requests.get(url, timeout=8) myfile_name=url.split("/")[-1] myfile_name_wout_ext=myfile_name[0:-4] open('C:\\Users\\username\\Downloads\\' + myfile_name, 'wb').write(myfile.content) subprocess.call(['C:\\Users\\username\\pythonrun\\pdftotext' + '\\pdftotext', myfile_name, myfile_name_wout_ext+".txt"]) with open('C:\\Users\\username\\Downloads\\' + myfile_name_wout_ext+".txt", 'r') as content_file: content = content_file.read() return content if url.endswith(".pdf"): txt = get_txt_from_pdf(full_url)
Cleaning Extracted Text
Once text is extracted from pdf or html we need to remove not useful text.
Below are processing actions that are implemented in the script:
- remove non content text like script, html, tags (it is only for html pages)
- remove non text characters
- remove repeating spaces
- remove documents if the size of document less then some min number of characters (MIN_LENGTH_of_document)
- remove bad requests results – for example the request to get content from specific link was not successful but still resulted in some text.
Getting Links from Chrome History
To get visited links we need query Chrome web browser database with simple SQL statement. This is well described on some other web blogs. You can find link also in the references below [1].
Additionally when we extracting from Chrome history we need remove links that are out of scope – example you are extracting links that you used for reading about data mining. So links where you access your banking site or friends on facebook are not related.
To sort out not related links we can insert in sql statement filtering criteria with NOT Like * or <> as below:
select_statement = “SELECT urls.url FROM urls WHERE urls.url NOT Like ‘%localhost%’ AND urls.url NOT Like ‘%google%’ AND urls.visit_count > 0 AND urls.url <> ‘https://www.reddit.com/’ ;”
Conclusion
We learned how to extract text from website (pdf or html). We built the script for two practical examples: when we use links from Chrome web browser history or when we have list of links extracted from somewhere, for example from Twitter search results. The next step would be extract insights from the obtained text data using machine learning or text mining. For example from chrome history we could build frequent questions that developer searches in the web browser and create faster way to access information.
# -*- coding: utf-8 -*- import os import sqlite3 import operator from collections import OrderedDict import time import csv from bs4 import BeautifulSoup from bs4.element import Comment import requests import re import subprocess MIN_LENGTH_of_document = 40 MIN_LENGTH_of_word = 2 USE_LINKS_FROM_CHROME_HISTORY = False #if false will use from csv file def remove_min_words(txt): shortword = re.compile(r'\W*\b\w{1,1}\b') return(shortword.sub('', txt)) def clean_txt(text): text = re.sub('[^A-Za-z. ]', ' ', text) text=' '.join(text.split()) text = remove_min_words(text) text=text.lower() text = text if len(text) >= MIN_LENGTH_of_document else "" return text def tag_visible(element): if element.parent.name in ['style', 'script', 'head', 'meta', '[document]']: return False if isinstance(element, Comment): return False return True def get_txt_from_pdf(url): myfile = requests.get(url, timeout=8) myfile_name=url.split("/")[-1] myfile_name_wout_ext=myfile_name[0:-4] open('C:\\Users\\username\\Downloads\\' + myfile_name, 'wb').write(myfile.content) subprocess.call(['C:\\Users\\username\\pythonrun\\pdftotext' + '\\pdftotext', myfile_name, myfile_name_wout_ext+".txt"]) with open('C:\\Users\\username\\Downloads\\' + myfile_name_wout_ext+".txt", 'r') as content_file: content = content_file.read() return content def get_text(url): print (url) try: req = requests.get(url, timeout=5) except: return "TIMEOUT ERROR" data = req.text soup = BeautifulSoup(data, "html.parser") texts = soup.findAll(text=True) visible_texts = filter(tag_visible, texts) return u" ".join(t.strip() for t in visible_texts) def parse(url): try: parsed_url_components = url.split('//') sublevel_split = parsed_url_components[1].split('/', 1) domain = sublevel_split[0].replace("www.", "") return domain except IndexError: print ("URL format error!") def get_links_from_chrome_history(): #path to user's history database (Chrome) data_path = os.path.expanduser('~')+"\\AppData\\Local\\Google\\Chrome\\User Data\\Default" history_db = os.path.join(data_path, 'history') #querying the db c = sqlite3.connect(history_db) cursor = c.cursor() select_statement = "SELECT urls.url FROM urls WHERE urls.url NOT Like '%localhost%' AND urls.url NOT Like '%google%' AND urls.visit_count > 0 AND urls.url <> 'https://www.reddit.com/' ;" cursor.execute(select_statement) results_tuples = cursor.fetchall() return ([x[0] for x in results_tuples]) def get_links_from_csv_file(): links_from_csv = [] filename = 'C:\\Users\\username\\pythonrun\\links.csv' col_id=0 with open(filename, newline='', encoding='utf-8-sig') as f: reader = csv.reader(f) try: for row in reader: links_from_csv.append(row[col_id]) except csv.Error as e: print('file {}, line {}: {}'.format(filename, reader.line_num, e)) return links_from_csv results=[] if USE_LINKS_FROM_CHROME_HISTORY: results = get_links_from_chrome_history() fname="data_from_chrome_history_links.csv" else: results=get_links_from_csv_file() fname="data_from_file_links.csv" sites_count = {} full_sites_count = {} with open(fname, 'w', encoding="utf8", newline='' ) as csvfile: fieldnames = ['URL', 'URL Base', 'TXT'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() count_url=0 for url in results: print (url) full_url=url url = parse(url) if full_url in full_sites_count: full_sites_count[full_url] += 1 else: full_sites_count[full_url] = 1 if url.endswith(".pdf"): txt = get_txt_from_pdf(full_url) else: txt = get_text(full_url) txt=clean_txt(txt) writer.writerow({'URL': full_url, 'URL Base': url, 'TXT': txt}) time.sleep(4) if url in sites_count: sites_count[url] += 1 else: sites_count[url] = 1 count_url +=1
References
1. Analyze Chrome’s Browsing History with Python
2. XpdfReader
3. Python: Remove words from a string of length between 1 and a given number
4. BeautifulSoup Grab Visible Webpage Text
5. Web Scraping 101 with Python & Beautiful Soup
6. Downloading Files Using Python (Simple Examples)
7. Introduction to web scraping in Python
8. Ultimate guide to deal with Text Data (using Python) – for Data Scientists and Engineers