Sentiment Analysis with VADER

Sentiment analysis (also known as opinion mining ) refers to the use of natural language processing, text analysis, computational linguistics to systematically identify, extract, quantify, and study affective states and subjective information. [1] In short, Sentiment analysis gives an objective idea of whether the text uses mostly positive, negative, or neutral language. [2]

Sentiment analysis software can assist estimate people opinion on the events in finance world, generate reports for relevant information, analyze correlation between events and stock prices.


Image by Gino Crescoli from Pixabay

The problem

In this post we investigate how to extract information about company and detect its sentiment. For each text sentence or paragraph we will detect its positivity or negativity by calculating sentiment score. This also called polarity. Polarity in sentiment analysis refers to identifying sentiment orientation (positive, neutral, and negative) in the text.

Given the list of companies we want to find polarity of sentiment in the text that has names of companies from the list. Below is the description how it can be implemented.

Getting Data

We will use google to collect data. For this we search google via script for documents with some predefined keywords.
It will return the links that we will save to array.

try: 
    from googlesearch import search 
except ImportError:  
    print("No module named 'google' found") 
  
# to search 
query = "financial_news Warren Buffett 2019"

links=[]  
for j in search(query, tld="co.in", num=10, stop=10, pause=6): 
    print(j) 
    links.append(j)

Preprocessing

After we got links, we need get text documents and remove not needed text and characters. So in this step we remove html tags, not valid characters. We keep however paragraph tags. Using paragraph tag we divide text document in smaller text units. After that we remove p tags.

para_data=[]


def get_text(url):
   print (url) 
   
   try:
      req  = requests.get(url, timeout=5)
   except: 
      return "TIMEOUT ERROR"  
  
   data = req.text
   soup = BeautifulSoup(data, "html.parser")
   
     
   paras=[]
   paras_ = soup.find_all('p')
   filtered_paras= filter(tag_visible, paras_)
   for s in filtered_paras:
       paras.append(s)
   if len(paras) > 0:
      for i, para in enumerate(paras):
           para=remove_tags(para)
           # remove non text characters
           para_data.append(clean_txt(para))

Calculating Sentiment

Now we calculate sentiment score using VADER (Valence Aware Dictionary and sEntiment Reasoner) VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments.[3] Based on calculated sentiment we build plot. In this example we only build plot for first company name which is Coca Cola.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
  

def sentiment_scores(sentence): 
    
    # Create a SentimentIntensityAnalyzer object. 
    sid_obj = SentimentIntensityAnalyzer() 
  
    # polarity_scores method of SentimentIntensityAnalyzer 
    # oject gives a sentiment dictionary. 
    # which contains pos, neg, neu, and compound scores. 
    sentiment_dict = sid_obj.polarity_scores(sentence) 
      
    print("Overall sentiment dictionary is : ", sentiment_dict) 
    print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative") 
    print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral") 
    print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive")

    
    # decide sentiment as positive, negative and neutral 
    if sentiment_dict['compound'] >= 0.05 : 
        print("Positive") 
        
    elif sentiment_dict['compound'] <= - 0.05 : 
        print("Negative") 
  
    else : 
        print("Neutral") 
    return sentiment_dict['compound'] 

Below you can find full source code.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup, NavigableString
from bs4.element import Comment
import requests
import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text_string):
    print (text_string)
    
    return TAG_RE.sub('', str(text_string))

MIN_LENGTH_of_document = 40
MIN_LENGTH_of_word = 2

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
  
# function to print sentiments 
# of the sentence. 
# below function based on function from https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
def sentiment_scores(sentence): 
    
    # Create a SentimentIntensityAnalyzer object. 
    sid_obj = SentimentIntensityAnalyzer() 
  
    # polarity_scores method of SentimentIntensityAnalyzer 
    # oject gives a sentiment dictionary. 
    # which contains pos, neg, neu, and compound scores. 
    sentiment_dict = sid_obj.polarity_scores(sentence) 
      
    print("Overall sentiment dictionary is : ", sentiment_dict) 
    print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative") 
    print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral") 
    print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive")
    print("Sentence Overall Rated As", end = " ") 
  
    return sentiment_dict['compound']    

def remove_min_words(txt):
   # https://www.w3resource.com/python-exercises/re/python-re-exercise-49.php
   shortword = re.compile(r'\W*\b\w{1,1}\b')
   return(shortword.sub('', txt))        
    
        
def clean_txt(text):
  
   text = re.sub('[^A-Za-z.  ]', ' ', str(text))
   text=' '.join(text.split())
   text = remove_min_words(text)
   text=text.lower()
   text = text if  len(text) >= MIN_LENGTH_of_document else ""
   return text
        

def between(cur, end):
    while cur and cur != end:
        if isinstance(cur, NavigableString):
            text = cur.strip()
            if len(text):
                yield text
        cur = cur.next_element
        
def next_element(elem):
    while elem is not None:
        # Find next element, skip NavigableString objects
        elem = elem.next_sibling
        if hasattr(elem, 'name'):
            return elem        

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True
    
para_data=[]


def get_text(url):
   print (url) 
   
   try:
      req  = requests.get(url, timeout=5)
   except: 
      return "TIMEOUT ERROR"  
  
   data = req.text
   soup = BeautifulSoup(data, "html.parser")
   
     
   paras=[]
   paras_ = soup.find_all('p')
   filtered_paras= filter(tag_visible, paras_)
   for s in filtered_paras:
       paras.append(s)
   if len(paras) > 0:
      for i, para in enumerate(paras):
           para=remove_tags(para)
           # remove non text characters
           para_data.append(clean_txt(para))
           
     

try: 
    from googlesearch import search 
except ImportError:  
    print("No module named 'google' found") 
  
# to search 
query = "coca cola 2019"

links=[]  
for j in search(query, tld="co.in", num=25, stop=25, pause=6): 
    print(j) 
    links.append(j)
  
# Here our list consists from one company name, but it can include more than one.  
orgs=["coca cola" ]    
 
results=[] 
count=0  


def update_dict_value( dict, key, value):
    if key in dict:
           dict[key]= dict[key]+value
    else:
           dict[key] =value
    return dict

for link in links:
    # will update paras - array of paragraphs
    get_text(link)
    
    for pr in para_data:
              
        for org in orgs:
            if pr.find (org) >=0:
                # extract sentiment
                score=0
                results.append ([org, sentiment_scores(pr), pr])


positive={}
negative={}
positive_sentiment={}
negative_sentiment={}

  
for i in range(len(results)):
    org = results[i][0]
   
    if (results[i][1] >=0):
        positive = update_dict_value( positive, org, 1)
        positive_sentiment =  update_dict_value( positive_sentiment, org,results[i][1])

    else:
        negative = update_dict_value( negative, org, 1)
        negative_sentiment =  update_dict_value( negative_sentiment, org,results[i][1])

for org in orgs:
   
    positive_sentiment[org]=positive_sentiment[org] / positive[org]
    print (negative_sentiment[org])
    negative_sentiment[org]=negative_sentiment[org] / negative[org]   

import matplotlib.pyplot as plt 


# x-coordinates of left sides of bars  
labels = ['negative', 'positive'] 
  
# heights of bars 
sentiment = [(-1)*negative_sentiment[orgs[0]], positive_sentiment[orgs[0]]] 


# labels for bars 
tick_label = ['negative', 'positive'] 
  
# plotting a bar chart 
plt.bar(labels, sentiment, tick_label = tick_label, 
        width = 0.8, color = ['red', 'green']) 
  
# naming the x-axis 
plt.xlabel('x - axis') 
# naming the y-axis 
plt.ylabel('y - axis') 
# plot title 
plt.title('Sentiment Analysis') 
  
# function to show the plot 
plt.show() 

References
1. Sentiment analysis Wikipedia
2. What is a “Sentiment Score” and how is it measured?
3. VADER-Sentiment-Analysis

How to Search Text Documents with Whoosh

Whoosh is a python library of classes and functions for indexing text and then searching the index. If the application requires text documents search functionality, Whoosh module can be used for this task. This post will summarize main steps needed for implementing search with Whoosh.

Text Search

Using Whoosh consists of indexing documents and then querying (searching) the index.
First we need to import needed modules:

from whoosh.fields import Schema, TEXT, ID
from whoosh import index

To index documents we need define folder where to save needed files.

import os.path
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

We also need define Schema – the set of all possible fields in a document.

The schema specifies the fields of documents in an index. Each document can have multiple fields, such as title, content, url, date, etc.

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))


ix = index.create_in("indexdir", schema)

writer = ix.writer()
writer.add_document(title=u"My document", content=u"This is my python document! hello big world",
                    path=u"/a")
writer.add_document(title=u"Second try", content=u"This is the second example hello world.",
                    path=u"/b")
writer.add_document(title=u"Third time's the charm", content=u"More examples. Examples are many.",
                    path=u"/c")

writer.commit()

Once index is created, we can search documents using index:

from whoosh.qparser import QueryParser

with ix.searcher() as searcher:
     query = QueryParser("content", ix.schema).parse("hello world")
     results = searcher.search(query, terms=True)
    
     for r in results:
         print (r, r.score)
         # Was this results object created with terms=True?
         if results.has_matched_terms():
            # What terms matched in the results?
            print(results.matched_terms())
        
     # What terms matched in each hit?
     print ("matched terms")
     for hit in results:
        print(hit.matched_terms())

The output that we get:

<Hit {'path': '/b', 'title': 'Second try', 'content': 'This is the second example hello world.'}>
<Hit {'path': '/b', 'title': 'Second try', 'content': 'This is the second example hello world.'}> 2.124137931034483
{('content', b'hello'), ('content', b'world')}
<Hit {'path': '/a', 'title': 'My document', 'content': 'This is my python document! hello big world'}> 1.7906976744186047
{('content', b'hello'), ('content', b'world')}
matched terms
[('content', b'hello'), ('content', b'world')]
[('content', b'hello'), ('content', b'world')]

Whoosh has many features that can enhance searching. We can get more documents like a certain search hit. This requires that the field you want to match on is vectored or stored, or that you have access to the original text (such as from a database). Here is the example, more_like_this() is used for this.

print ("more_results")
     first_hit = results[0]
     more_results = first_hit.more_like_this("content")
     print (more_results)   

Output:

more_results
<Top 1 Results for Or([Term('content', 'example', boost=0.6588835188105945), Term('content', 'second', boost=0.6588835188105945), Term('content', 'hello', boost=0.5617184491361429), Term('content', 'world', boost=0.5617184491361429)]) runtime=0.0038603000000136944>  

If we want to know the number of matched documents we can call len(results) but on very large indexes it can cause delay, but there is a way avoid this by getting just low and high estimate.

found = results.scored_length()
if results.has_exact_length():
    print("Scored", found, "of exactly", len(results), "documents")
else:
    low = results.estimated_min_length()
    high = results.estimated_length()

    print("Scored", found, "of between", low, "and", high, "documents")    

Below you can find full python source code for the above and references to the Whoosh documentation and other articles about Whoosh. You will find how to use Whoosh with pandas or how to use Whoosh with web2py for web crawling project.

# -*- coding: utf-8 -*-

from whoosh.fields import Schema, TEXT, ID
from whoosh import index

#To create an index in a directory, use index.create_in:

import os.path

if not os.path.exists("indexdir"):
    os.mkdir("indexdir")
    
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))


ix = index.create_in("indexdir", schema)

writer = ix.writer()
writer.add_document(title=u"My document", content=u"This is my python document! hello big world",
                    path=u"/a")
writer.add_document(title=u"Second try", content=u"This is the second example hello world.",
                    path=u"/b")
writer.add_document(title=u"Third time's the charm", content=u"More examples. Examples are many.",
                    path=u"/c")

writer.commit()


from whoosh.qparser import QueryParser

with ix.searcher() as searcher:
     query = QueryParser("content", ix.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     print(results[0])

     for r in results:
         print (r, r.score)
         # Was this results object created with terms=True?
         if results.has_matched_terms():
            # What terms matched in the results?
            print(results.matched_terms())
        
     # What terms matched in each hit?
     print ("matched terms")
     for hit in results:
        print(hit.matched_terms())

     

     print ("more_results")
     first_hit = results[0]
     more_results = first_hit.more_like_this("content")
     print (more_results)     
        
    
found = results.scored_length()
if results.has_exact_length():
    print("Scored", found, "of exactly", len(results), "documents")
else:
    low = results.estimated_min_length()
    high = results.estimated_length()

    print("Scored", found, "of between", low, "and", high, "documents")    

References

1. Quickstart
2. Developing a fast Indexing and Full text Search Engine with Whoosh: A Pure-Python Library
3. Whoosh , Pandas, and Redshift: Implementing Full Text Search in a Relational Database
4. USING WHOOSH WITH WEB2PY

Running R Package POMDP from Python

python R

Chatbots are now used in many applications for different purposes. The popularity of this type widget can be estimated from this fact:
As of August 2019, search results on Google for the following keywords:

  • chatbot – Volume: 246,000 searches per month and found 32,700,000 results
  • neural net – Volume 3600 searches per month and 127,000,000 results
  • One of the key components of chatbot architecture is Dialog Management. In order to incorporate some degree of uncertainty in classifying intents and entities, partially observable Markov decision process (POMDP) algorithm was proposed for building chatbots with Dialog Management. [1],[2]

    How can we run POMDP? There is a package in R for this.

    But what if we want run from python? One of the way could be call R from python. Here is how it can be done.

    First we need to create R program. For running POMDP in R we need first download and install (only one time). We also need set working directory (if it is different than default) and specify pdf file name – this is where the R
    script will output charts and diagrams. See below lines for this:

    r = getOption("repos")
    r["CRAN"] = "http://cran.us.r-project.org"
    options(repos = r)
    
    setwd ("C://Users//username//Documents")
    
    ## the below line need to be run only first time and it must be run with admin priv.
    ##install.packages("pomdp")
    library("pomdp")
    pdf('rplot123.pdf')
    

    Now we can continue with our R program. Documentation for POMDP [3] has R code for tiger problem, that we can insert here.

    Running POMDP R Package from Python

    Now we need to create python script that will call R script. In addition to calling the R program, python will also display output from running R program. This is possible because when we call R program, we save output within txt file. So the python script will save output into file and also output on screen.

    # -*- coding: utf-8 -*-
    import os
    
    
    os.system('"C:\\Program Files\\R\\R-3.4.3\\bin\\Rscript" C:\\Users\\username\\POMDP_R\\r_example.r > C:\\Users\\username\\output_file_ex.txt')
    
    with open('output_file_ex.txt', 'r') as reader:
                 print(reader.read())
    

    Now we can run R package POMDP or any other R program from python and start build advanced chatbot dialog system.

    References
    1. Dialog management
    2. An Improved Approach of Intention Discovery with Machine Learning for POMDP-based Dialogue Management
    3. POMDP: Introduction to Partially Observable Markov Decision Processes

    How to Extract Text from Website

    Extracting data from the Web using scripts (web scraping) is widely used today for numerous purposes. One of the parts of this process is downloading actual text from urls. This will be the topic of this post.

    We will consider how it can be done using the following case examples:
    Extracting information from visited links of history of using Chrome browser.

    Extracting information from list of links. For example in the previous post we looked how to extract links from twitter search results into csv file. This file will be now the source of links.

    Below will follow the python script implementation of main parts. It is using few code snippets and posts from the web. References and full source code are provided in the end.

    Switching Between Cases
    The script is using a variable USE_LINKS_FROM_CHROME_HISTORY to select correct program flow. If USE_LINKS_FROM_CHROME_HISTORY is true it will start extract links from Chrome, otherwise will use file with links.

    results=[]
    if  USE_LINKS_FROM_CHROME_HISTORY:
            results =  get_links_from_chrome_history() 
            fname="data_from_chrome_history_links.csv"
    else:
            results=get_links_from_csv_file()
            fname="data_from_file_links.csv"
    

    Extracting Content From HTML Links
    We use python libraries BeautifulSoup for processing HTML and requests library for downloading HTML:

    from bs4 import BeautifulSoup
    from bs4.element import Comment
    import requests
    
    def tag_visible(element):
        if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True
    
    def get_text(url):
       print (url) 
       
       try:
          req  = requests.get(url, timeout=5)
       except: 
          return "TIMEOUT ERROR"  
      
       data = req.text
       soup = BeautifulSoup(data, "html.parser")
       texts = soup.findAll(text=True)
       visible_texts = filter(tag_visible, texts)  
       return u" ".join(t.strip() for t in visible_texts)
    

    Extracting Content from PDF Format with PDF to Text Python

    Not all links will give html page. Some might lead to pdf data format page. For this we need to use specific process of getting text from pdf. There are several solutions possible. Here we will use pdftotext exe file. [2] With this method we create function as below and call it when url ends with “.pdf”.

    To make actual conversion from pdf to txt we use subprocess.call and provide location of pdftotext.exe file, filename of pdf file and filename of new txt file. Note that we first download pdf page to pdf file on local drive.

    import subprocess
    def get_txt_from_pdf(url):
        myfile = requests.get(url, timeout=8)
        myfile_name=url.split("/")[-1] 
        myfile_name_wout_ext=myfile_name[0:-4]
        open('C:\\Users\\username\\Downloads\\' + myfile_name, 'wb').write(myfile.content)
        subprocess.call(['C:\\Users\\username\\pythonrun\\pdftotext' + '\\pdftotext', myfile_name, myfile_name_wout_ext+".txt"])
        with open('C:\\Users\\username\\Downloads\\' + myfile_name_wout_ext+".txt", 'r') as content_file:
            content = content_file.read()
        return content  
    
     if url.endswith(".pdf"):
                      txt = get_txt_from_pdf(full_url)
    

    Cleaning Extracted Text
    Once text is extracted from pdf or html we need to remove not useful text.
    Below are processing actions that are implemented in the script:

    • remove non content text like script, html, tags (it is only for html pages)
    • remove non text characters
    • remove repeating spaces
    • remove documents if the size of document less then some min number of characters (MIN_LENGTH_of_document)
    • remove bad requests results – for example the request to get content from specific link was not successful but still resulted in some text.

    Getting Links from Chrome History
    To get visited links we need query Chrome web browser database with simple SQL statement. This is well described on some other web blogs. You can find link also in the references below [1].

    Additionally when we extracting from Chrome history we need remove links that are out of scope – example you are extracting links that you used for reading about data mining. So links where you access your banking site or friends on facebook are not related.

    To sort out not related links we can insert in sql statement filtering criteria with NOT Like * or <> as below:
    select_statement = “SELECT urls.url FROM urls WHERE urls.url NOT Like ‘%localhost%’ AND urls.url NOT Like ‘%google%’ AND urls.visit_count > 0 AND urls.url <> ‘https://www.reddit.com/’ ;”

    Conclusion
    We learned how to extract text from website (pdf or html). We built the script for two practical examples: when we use links from Chrome web browser history or when we have list of links extracted from somewhere, for example from Twitter search results. The next step would be extract insights from the obtained text data using machine learning or text mining. For example from chrome history we could build frequent questions that developer searches in the web browser and create faster way to access information.

    # -*- coding: utf-8 -*-
    
    import os
    import sqlite3
    import operator
    from collections import OrderedDict
    
    import time
    import csv
    
    from bs4 import BeautifulSoup
    from bs4.element import Comment
    import requests
    import re
    import subprocess
    
    
    MIN_LENGTH_of_document = 40
    MIN_LENGTH_of_word = 2
    USE_LINKS_FROM_CHROME_HISTORY = False #if false will use from csv file
    
    def remove_min_words(txt):
       
       shortword = re.compile(r'\W*\b\w{1,1}\b')
       return(shortword.sub('', txt))
    
    
    def clean_txt(text):
       text = re.sub('[^A-Za-z.  ]', ' ', text)
       text=' '.join(text.split())
       text = remove_min_words(text)
       text=text.lower()
       text = text if  len(text) >= MIN_LENGTH_of_document else ""
       return text
    
    def tag_visible(element):
        if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True
    
    
      
        
    def get_txt_from_pdf(url):
        myfile = requests.get(url, timeout=8)
        myfile_name=url.split("/")[-1] 
        myfile_name_wout_ext=myfile_name[0:-4]
        open('C:\\Users\\username\\Downloads\\' + myfile_name, 'wb').write(myfile.content)
        subprocess.call(['C:\\Users\\username\\pythonrun\\pdftotext' + '\\pdftotext', myfile_name, myfile_name_wout_ext+".txt"])
        with open('C:\\Users\\username\\Downloads\\' + myfile_name_wout_ext+".txt", 'r') as content_file:
            content = content_file.read()
        return content    
    
    
    def get_text(url):
       print (url) 
       
       try:
          req  = requests.get(url, timeout=5)
       except: 
          return "TIMEOUT ERROR"  
      
       data = req.text
       soup = BeautifulSoup(data, "html.parser")
       texts = soup.findAll(text=True)
       visible_texts = filter(tag_visible, texts)  
       return u" ".join(t.strip() for t in visible_texts)
    
    
    def parse(url):
    	try:
    		parsed_url_components = url.split('//')
    		sublevel_split = parsed_url_components[1].split('/', 1)
    		domain = sublevel_split[0].replace("www.", "")
    		return domain
    	except IndexError:
    		print ("URL format error!")
    
    
    def get_links_from_chrome_history():
       #path to user's history database (Chrome)
       data_path = os.path.expanduser('~')+"\\AppData\\Local\\Google\\Chrome\\User Data\\Default"
     
       history_db = os.path.join(data_path, 'history')
    
       #querying the db
       c = sqlite3.connect(history_db)
       cursor = c.cursor()
       select_statement = "SELECT urls.url FROM urls WHERE urls.url NOT Like '%localhost%' AND urls.url NOT Like '%google%' AND urls.visit_count > 0 AND urls.url <> 'https://www.reddit.com/' ;"
       cursor.execute(select_statement)
    
       results_tuples = cursor.fetchall() 
      
       return ([x[0] for x in results_tuples])
       
       
    def get_links_from_csv_file():
       links_from_csv = []
       
       filename = 'C:\\Users\\username\\pythonrun\\links.csv'
       col_id=0
       with open(filename, newline='', encoding='utf-8-sig') as f:
          reader = csv.reader(f)
         
          try:
            for row in reader:
                
                links_from_csv.append(row[col_id])
          except csv.Error as e:
            print('file {}, line {}: {}'.format(filename, reader.line_num, e))
       return links_from_csv   
       
     
    results=[]
    if  USE_LINKS_FROM_CHROME_HISTORY:
            results =  get_links_from_chrome_history() 
            fname="data_from_chrome_history_links.csv"
    else:
            results=get_links_from_csv_file()
            fname="data_from_file_links.csv"
            
            
    
    sites_count = {} 
    full_sites_count = {}
    
    
    
    with open(fname, 'w', encoding="utf8", newline='' ) as csvfile: 
      fieldnames = ['URL', 'URL Base', 'TXT']
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
      writer.writeheader()
    
      
      count_url=0
      for url in results:    
          print (url)
          full_url=url
          url = parse(url)
          
          if full_url in full_sites_count:
                full_sites_count[full_url] += 1
          else:
                full_sites_count[full_url] = 1
              
                if url.endswith(".pdf"):
                      txt = get_txt_from_pdf(full_url)
                else:
                      txt = get_text(full_url)
                txt=clean_txt(txt)
                writer.writerow({'URL': full_url, 'URL Base': url, 'TXT': txt})
                time.sleep(4)
          
          
          
         
          if url in sites_count:
                sites_count[url] += 1
          else:
                sites_count[url] = 1
       
          count_url +=1
    

    References
    1. Analyze Chrome’s Browsing History with Python
    2. XpdfReader
    3. Python: Remove words from a string of length between 1 and a given number
    4. BeautifulSoup Grab Visible Webpage Text
    5. Web Scraping 101 with Python & Beautiful Soup
    6. Downloading Files Using Python (Simple Examples)
    7. Introduction to web scraping in Python
    8. Ultimate guide to deal with Text Data (using Python) – for Data Scientists and Engineers

    Twitter Text Mining with Python

    In this post (and few following posts) we will look how to get interesting information by extracting links from results of Twitter search by keywords and using machine learning text mining. While there many other posts on the same topic, we will cover also additional small steps that are needed to process data. This includes such tasks as unshorting urls, setting date interval, saving or reading information.

    Below we will focus on extracting links from results of Twitter search API python.

    Getting Login Information for Twitter API

    The first step is set up application on Twitter and get login information. This is already described in some posts on the web [1].
    Below is the code snippet for this:

    your code here
    import tweepy as tw
        
    CONSUMER_KEY ="xxxxx"
    CONSUMER_SECRET ="xxxxxxx"
    OAUTH_TOKEN = "xxxxx"
    OAUTH_TOKEN_SECRET = "xxxxxx"
    
    auth = tw.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
    api = tw.API(auth, wait_on_rate_limit=True)
    

    Defining the Search Values

    Now you can do search by keywords or hashtags and get tweets.
    When we do search we might want to specify the start day so it will give results dated on this start day or after.

    For this we can code as the following:

    from datetime import datetime
    from datetime import timedelta
    
    NUMBER_of_TWEETS = 20
    SEARCH_BEHIND_DAYS=60
    today_date=datetime.today().strftime('%Y-%m-%d')
    
    
    today_date_datef = datetime.strptime(today_date, '%Y-%m-%d')
    start_date = today_date_datef - timedelta(days=SEARCH_BEHIND_DAYS)
    
    
    for search_term in search_terms:
      tweets = tw.Cursor(api.search,
                       q=search_term,
                       lang="en",
                       since=SEARCH_BEHIND_DAYS).items(NUMBER_of_TWEETS)
    

    The above search will return 20 tweets and will look only within 60 days from day of search. If we want use fixed date we can replace with since=’2019-12-01′

    Processing Extracted Links

    Once we got tweets text we can extract links. However we will get different types of links. Some are internal twitter links, some are shorten, some are regular urls.

    So here is the function to sort out the links. We do not need internal links – the links that belong to Twitter navigation or other functionality.

    try:
        import urllib.request as urllib2
    except ImportError:
        import urllib2
    
    
    import http.client
    import urllib.parse as urlparse   
    
    def unshortenurl(url):
        parsed = urlparse.urlparse(url) 
        h = http.client.HTTPConnection(parsed.netloc) 
        h.request('HEAD', parsed.path) 
        response = h.getresponse() 
        if response.status >= 300 and response.status < 400 and response.getheader('Location'):
            return response.getheader('Location') 
        else: return url 
    

    Once we got links we can save urls information in csv file. Together with the link we save twitter text, date.
    Additionally we count number of hashtags and links and also save this information into csv files. So the output of program is 3 csv files.

    Conclusion

    Looking in the output file we can quickly identify the links of interest. For example just during the testing this script I found two interesting links that I was not aware. In the following post we will look how to do even more automation for finding cool links using Twitter text mining.

    Below you can find full source code and the references to web resources that were used for this post or related to this topic.

    # -*- coding: utf-8 -*-
    
    import tweepy as tw
    import re
    import csv
    
    from datetime import datetime
    from datetime import timedelta
    
    NUMBER_of_TWEETS = 20
    SEARCH_BEHIND_DAYS=60
    today_date=datetime.today().strftime('%Y-%m-%d')
    
    
    today_date_datef = datetime.strptime(today_date, '%Y-%m-%d')
    start_date = today_date_datef - timedelta(days=SEARCH_BEHIND_DAYS)
    try:
        import urllib.request as urllib2
    except ImportError:
        import urllib2
    
    
    import http.client
    import urllib.parse as urlparse   
    
    def unshortenurl(url):
        parsed = urlparse.urlparse(url) 
        h = http.client.HTTPConnection(parsed.netloc) 
        h.request('HEAD', parsed.path) 
        response = h.getresponse() 
        if response.status >= 300 and response.status < 400 and response.getheader('Location'):
            return response.getheader('Location') 
        else: return url    
        
        
    CONSUMER_KEY ="xxxxx"
    CONSUMER_SECRET ="xxxxxxx"
    OAUTH_TOKEN = "xxxxxxxx"
    OAUTH_TOKEN_SECRET = "xxxxxxx"
    
    
    auth = tw.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
    api = tw.API(auth, wait_on_rate_limit=True)
    # Create a custom search term 
    
    search_terms=["#chatbot -filter:retweets", 
                  "#chatbot+machine_learning -filter:retweets", 
                  "#chatbot+python -filter:retweets",
                  "text classification -filter:retweets",
                  "text classification python -filter:retweets",
                  "machine learning applications -filter:retweets",
                  "sentiment analysis python  -filter:retweets",
                  "sentiment analysis  -filter:retweets"]
                  
            
                  
    def count_urls():
           url_counted = dict() 
           url_count = dict()
           with open('data.csv', 'r', encoding="utf8" ) as csvfile: 
               line = csvfile.readline()
               while line != '':  # The EOF char is an empty string
                
                   line = csvfile.readline()
                   items=line.split(",")
                   if len(items) < 3 :
                              continue
                               
                   url=items[1]
                   twt=items[2]
                   # key =  Tweet and Url
                   key=twt[:30] + "___" + url
                   
                   if key not in url_counted:
                          url_counted[key]=1
                          if url in url_count:
                               url_count[url] += 1
                          else:
                               url_count[url] = 1
           print_count_urls(url_count)             
    
           
    def print_count_urls(url_count_data):
       
             for key, value in url_count_data.items():
                  print (key, "=>", value)
                  
             with open('data_url_count.csv', 'w', encoding="utf8", newline='' ) as csvfile_link_count: 
                fieldnames = ['URL', 'Count']
                writer = csv.DictWriter(csvfile_link_count, fieldnames=fieldnames)
                writer.writeheader() 
                
                for key, value in url_count_data.items():
                     writer.writerow({'URL': key, 'Count': value })   
                
               
    def extract_hash_tags(s):
        return set(part[1:] for part in s.split() if part.startswith('#'))
        
    
       
    def save_tweet_info(tw, twt_dict, htags_dict ):
       
        if tw not in twt_dict:
            htags=extract_hash_tags(tw)
            twt_dict[tw]=1
            for ht in htags:
                if ht in htags_dict:
                    htags_dict[ht]=htags_dict[ht]+1
                else:   
                    htags_dict[ht]=1
    
    
    def print_count_hashtags(htags_count_data):
            
             for key, value in htags_count_data.items():
                  print (key, "=>", value)
                  
             with open('data_htags_count.csv', 'w', encoding="utf8", newline='' ) as csvfile_link_count: 
                fieldnames = ['Hashtag', 'Count']
                writer = csv.DictWriter(csvfile_link_count, fieldnames=fieldnames)
                writer.writeheader() 
                
                for key, value in htags_count_data.items():
                     writer.writerow({'Hashtag': key, 'Count': value })          
            
    
    
    tweet_dict = dict() 
    hashtags_dict = dict()
    
                     
    for search_term in search_terms:
      tweets = tw.Cursor(api.search,
                       q=search_term,
                       lang="en",
                       #since='2019-12-01').items(40)
                       since=SEARCH_BEHIND_DAYS).items(NUMBER_of_TWEETS)
    
      with open('data.csv', 'a', encoding="utf8", newline='' ) as csvfile: 
         fieldnames = ['Search', 'URL', 'Tweet', 'Entered on']
         writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
         writer.writeheader()
         
    
         for tweet in tweets:
             urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
       
             save_tweet_info(tweet.text, tweet_dict, hashtags_dict ) 
             for url in urls:
              try:
                res = urllib2.urlopen(url)
                actual_url = res.geturl()
             
                if ( ("https://twitter.com" in actual_url) == False):
                    
                    if len(actual_url) < 32:
                        actual_url =unshortenurl(actual_url) 
                    print (actual_url)
                  
                    writer.writerow({'Search': search_term, 'URL': actual_url, 'Tweet': tweet.text, 'Entered on': today_date })
                  
              except:
                  print (url)    
    
                
    print_count_hashtags(hashtags_dict)
    count_urls()      
    

    References

    1. Text mining: Twitter extraction and stepwise guide to generate a word cloud
    2. Analyze Word Frequency Counts Using Twitter Data and Tweepy in Python
    3. unshorten-url-in-python-3
    4. how-can-i-un-shorten-a-url-using-python
    5. extracting-external-links-from-tweets-in-python

    Document Similarity in Machine Learning Text Analysis with ELMo

    In this post we will look at using ELMo for computing similarity between text documents. Elmo is one of the word embeddings techniques that are widely used now. In the previous post we used TF-IDF for calculating text documents similarity. TF-IDF is based on word frequency counting. Both techniques can be used for converting text to numbers in information retrieval machine learning algorithms.

    ELMo

    The good tutorial that explains how ElMo is working and how it is built is Deep Contextualized Word Representations with ELMo
    Another resource is at ELMo

    We will however focus on the practical side of computing similarity between text documents with ELMo. Below is the code to accomplish this task. To compute elmo embeddings I used function from Analytics Vidhya machine learning post at learn-to-use-elmo-to-extract-features-from-text/

    We will use cosine_similarity module from sklearn to calculate similarity between numeric vectors. It computes cosine similarity between samples in X and Y as the normalized dot product of X and Y.

    # -*- coding: utf-8 -*-
    
    from sklearn.metrics.pairwise import cosine_similarity
    
    import tensorflow_hub as hub
    import tensorflow as tf
    
    elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
    
    
    def elmo_vectors(x):
      
      embeddings=elmo(x, signature="default", as_dict=True)["elmo"]
     
      with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        # return average of ELMo features
        return sess.run(tf.reduce_mean(embeddings,1))
    

    Our data input will be the same as in previous post for TF-IDF: collection the sentences as an array. So each document here is represented just by one sentence.

    corpus=["I'd like an apple juice",
                                "An apple a day keeps the doctor away",
                                 "Eat apple every day",
                                 "We buy apples every week",
                                 "We use machine learning for text classification",
                                 "Text classification is subfield of machine learning"]
    
    

    Below we do elmo embedding for each document and create matrix for all collection. If we print elmo_embeddings for i=0 we will get word embeddings vector [ 0.02739557 -0.1004054 0.12195794 … -0.06023929 0.19663551 0.3809018 ] which is numeric representation of the first document.

    elmo_embeddings=[]
    print (len(corpus))
    for i in range(len(corpus)):
        print (corpus[i])
        elmo_embeddings.append(elmo_vectors([corpus[i]])[0])
       
    

    Finally we can print embeddings and similarity matrix

    print ( elmo_embeddings)
    print(cosine_similarity(elmo_embeddings, elmo_embeddings))
    
    
    
    [array([ 0.02739557, -0.1004054 ,  0.12195794, ..., -0.06023929,
            0.19663551,  0.3809018 ], dtype=float32), array([ 0.08833811, -0.21392687, -0.0938901 , ..., -0.04924499,
            0.08270906,  0.25595033], dtype=float32), array([ 0.45237526, -0.00928468,  0.5245862 , ...,  0.00988374,
           -0.03330074,  0.25460464], dtype=float32), array([-0.14745474, -0.25623208,  0.20231596, ..., -0.11443609,
           -0.03759   ,  0.18829307], dtype=float32), array([-0.44559947, -0.1429281 , -0.32497618, ...,  0.01917108,
           -0.29726124, -0.02022664], dtype=float32), array([-0.2502797 ,  0.09800234, -0.1026585 , ..., -0.22239089,
            0.2981896 ,  0.00978719], dtype=float32)]
    
    
    
    The similarity matrix computed as :
    [[0.9999998  0.609864   0.574287   0.53863835 0.39638174 0.35737067]
     [0.609864   0.99999976 0.6036072  0.5824003  0.39648792 0.39825168]
     [0.574287   0.6036072  0.9999998  0.7760986  0.3858403  0.33461633]
     [0.53863835 0.5824003  0.7760986  0.9999995  0.4922789  0.35490626]
     [0.39638174 0.39648792 0.3858403  0.4922789  0.99999976 0.73076516]
     [0.35737067 0.39825168 0.33461633 0.35490626 0.73076516 1.0000002 ]]
    

    Now we can compare this similarity matrix with matrix obtained with TF-IDF in prev post. Obviously they are different.

    Thus, we calculated similarity between textual documents using ELMo. This post and previous post about using TF-IDF for the same task are great machine learning exercises. Because we use text conversion to numbers, document similarity in many algorithms of information retrieval, data science or machine learning.

    Document Similarity in Machine Learning Text Analysis with TF-IDF

    Despite of the appearance of new word embedding techniques for converting textual data into numbers, TF-IDF still often can be found in many articles or blog posts for information retrieval, user modeling, text classification algorithms, text analytics (extracting top terms for example) and other text mining techniques.

    In this text we will look what is TF-IDF, how we can calculate TF-IDF, retrieve calculated values in different formats and how we compute similarity between 2 text documents using TF-IDF technique.

    tf–idf is term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.[1]

    Here we will look how we can convert text corpus of documents to numbers and how we can use above technique for computing document similarity.

    We will use sklearn.feature_extraction.text.TfidfVectorizer from python scikit-learn library for calculating tf-idf. TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features.

    We need to provide text documents as input, all other input parameters are optional and have default values or set to None. [2]

    Here is the list of inputs from documentation:

    input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None,
    tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1),
    max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False,dtype=, norm=’l2’,
    use_idf=True, smooth_idf=True, sublinear_tf=False)

    Our text documents will be represented just one sentence and all documents will be inputted as via array corpus.
    Below code demonstrates how to get document similarity matrix.

    # -*- coding: utf-8 -*-
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    from sklearn.metrics.pairwise import cosine_similarity
    import pandas as pd
    
    corpus=["I'd like an apple juice",
                                "An apple a day keeps the doctor away",
                                 "Eat apple every day",
                                 "We buy apples every week",
                                 "We use machine learning for text classification",
                                 "Text classification is subfield of machine learning"]
    
    vect = TfidfVectorizer(min_df=1)
    tfidf = vect.fit_transform(corpus)
    print ((tfidf * tfidf.T).A)
    
    
    """
    [[1.         0.2688172  0.16065234 0.         0.         0.        ]
     [0.2688172  1.         0.28397982 0.         0.         0.        ]
     [0.16065234 0.28397982 1.         0.19196066 0.         0.        ]
     [0.         0.         0.19196066 1.         0.13931166 0.        ]
     [0.         0.         0.         0.13931166 1.         0.48695659]
     [0.         0.         0.         0.         0.48695659 1.        ]]
    """ 
    

    We can print all our features or the values of features for specific document. In our example feature is a word, but it can be also 2 or more words:

    print(vect.get_feature_names())
    #['an', 'apple', 'apples', 'away', 'buy', 'classification', 'day', 'doctor', 'eat', 'every', 'for', 'is', 'juice', 'keeps', 'learning', 'like', 'machine', 'of', 'subfield', 'text', 'the', 'use', 'we', 'week']
    print(tfidf.shape)
    #(6, 24)
    
    
    print (tfidf[0])
    """
      (0, 15)	0.563282410145744
      (0, 0)	0.46189963418608976
      (0, 1)	0.38996740989416023
      (0, 12)	0.563282410145744
    """  
    

    We can load features in dataframe and print them from dataframe in several ways:

    df=pd.DataFrame(tfidf.toarray(), columns=vect.get_feature_names())
    
    print (df)
    
    """
             an     apple    apples    ...          use        we      week
    0  0.461900  0.389967  0.000000    ...     0.000000  0.000000  0.000000
    1  0.339786  0.286871  0.000000    ...     0.000000  0.000000  0.000000
    2  0.000000  0.411964  0.000000    ...     0.000000  0.000000  0.000000
    3  0.000000  0.000000  0.479748    ...     0.000000  0.393400  0.479748
    4  0.000000  0.000000  0.000000    ...     0.431849  0.354122  0.000000
    5  0.000000  0.000000  0.000000    ...     0.000000  0.000000  0.000000
    """
    
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):   
        print(df)
    
    """
         doctor       eat     every       for        is     juice     keeps  \
    0  0.000000  0.000000  0.000000  0.000000  0.000000  0.563282  0.000000   
    1  0.414366  0.000000  0.000000  0.000000  0.000000  0.000000  0.414366   
    2  0.000000  0.595054  0.487953  0.000000  0.000000  0.000000  0.000000   
    3  0.000000  0.000000  0.393400  0.000000  0.000000  0.000000  0.000000   
    4  0.000000  0.000000  0.000000  0.431849  0.000000  0.000000  0.000000   
    5  0.000000  0.000000  0.000000  0.000000  0.419233  0.000000  0.000000   
    
       learning      like   machine        of  subfield      text       the  \
    0  0.000000  0.563282  0.000000  0.000000  0.000000  0.000000  0.000000   
    1  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.414366   
    2  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
    3  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
    4  0.354122  0.000000  0.354122  0.000000  0.000000  0.354122  0.000000   
    5  0.343777  0.000000  0.343777  0.419233  0.419233  0.343777  0.000000   
    
            use        we      week  
    0  0.000000  0.000000  0.000000  
    1  0.000000  0.000000  0.000000  
    2  0.000000  0.000000  0.000000  
    3  0.000000  0.393400  0.479748  
    4  0.431849  0.354122  0.000000  
    5  0.000000  0.000000  0.000000  
    
    """    
    # this prints but not nice as above    
    print(df.to_string())    
    
    
    
    print ("Second Column");
    print (df.iloc[1])
    """
    an                0.339786
    apple             0.286871
    apples            0.000000
    away              0.414366
    buy               0.000000
    classification    0.000000
    day               0.339786
    doctor            0.414366
    eat               0.000000
    every             0.000000
    for               0.000000
    is                0.000000
    juice             0.000000
    keeps             0.414366
    learning          0.000000
    like              0.000000
    machine           0.000000
    of                0.000000
    subfield          0.000000
    text              0.000000
    the               0.414366
    use               0.000000
    we                0.000000
    week              0.000000
    """
    print ("Second Column only values (without keys");
    print (df.iloc[1].values)
    
    """
    [0.33978594 0.28687063 0.         0.41436586 0.         0.
     0.33978594 0.41436586 0.         0.         0.         0.
     0.         0.41436586 0.         0.         0.         0.
     0.         0.         0.41436586 0.         0.         0.        ]
    """ 
    

    Finally we can compute document similarity matrix using cosine_similarity. And we got the same matrix that we got in the beginning using just ((tfidf * tfidf.T).A).

    print(cosine_similarity(df.values, df.values))
    
    """
    [[1.         0.2688172  0.16065234 0.         0.         0.        ]
     [0.2688172  1.         0.28397982 0.         0.         0.        ]
     [0.16065234 0.28397982 1.         0.19196066 0.         0.        ]
     [0.         0.         0.19196066 1.         0.13931166 0.        ]
     [0.         0.         0.         0.13931166 1.         0.48695659]
     [0.         0.         0.         0.         0.48695659 1.        ]]
    """ 
    
    print ("Number of docs in corpus")
    print (len(corpus))
    

    So in this post we learned how to use tf idf sklearn, get values in different formats, load to dataframe and calculate document similarity matrix using just tfidf values or cosine similarity function from sklearn.metrics.pairwise. This techniques can be used in machine learning text analysis, information retrieval machine learning, text mining process and many other areas when we need convert textual data into numeric data (or features).

    References
    1. Tf-idf – Wikipedia
    2. TfidfVectorizer

    7+ Best Online Resources for Text Preprocessing for Machine Learning Algorithms

    With advance of machine learning , natural language processing and increasing available information on the web, the use of text data in machine learning algorithms is growing. The important step in using text data is preprocessing original raw text data. The data preparation steps may include the following:

    • Tokenization
    • Removing punctuation
    • Removing stop words
    • Stemming
    • Word Embedding
    • Named-entity recognition (NER)
    • Coreference resolution – finding all expressions that refer to the same entity in a text

    Recently created new articles on this topic, greatly expanded examples of text preprocessing operations. In this post we collect and review online articles that are describing text prepocessing techniques with python code examples.

    1. textcleaner

    Text-Cleaner is a utility library for text-data pre-processing. It can be used before passing the text data to a model. textcleaner uses a open source projects such as NLTK – for advanced cleaning, REGEX – for regular expression.

    Features:

    • main_cleaner does all the below in one call
  • remove unnecessary blank lines
  • transfer all characters to lowercase if needed
  • remove numbers, particular characters (if needed), symbols and stop-words from the whole text
  • tokenize the text-data on one call
  • stemming & lemmatization powered by NLTK
  • textcleaner is saving time by providing basic cleaning functionality and allowing developer to focus on building machine learning model. The nice thing is that it can do many text processing steps in one call.

    Here is the example how to use:

    import textcleaner as tc
    
    f="C:\\textinputdata.txt"
    out=tc.main_cleaner(f)
    print (out)
    
    """
    input text:
    The house235 is very small!!
    the city is nice.
    I was in that city 10 days ago.
    The city2 is big.
    
    
    output text:
    [['hous', 'small'], ['citi', 'nice'], ['citi', 'day', 'ago'], ['citi', 'big']]
    """
    

    2. Guide for Text Preprocessing from Analytics Vidhya

    Analytics Vidhya regularly provides great practical resources about AI, ML, Analytics. In this ‘Ultimate guide to deal with Text Data’ you can find description of text preprocessing steps with python code. Different python libraries are utilized for solving text preprocessing tasks:
    NLTK – for stop list, stemming

    TextBlob – for spelling correction, tokenization, lemmatization. TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

    gensim – for word embeddings

    sklearn – for feature_extraction with TF-IDF

    The guide is covering text processing steps from basic to advanced.
    Basic steps :

    • Lower casing
    • Punctuation, stopwords, frequent and rare words removal
    • Spelling correction
    • Tokenization
    • Stemming
    • Lemmatization

    Advance Text Processing

    • N-grams
    • Term, Inverse Document Frequency
    • Term Frequency-Inverse Document Frequency (TF-IDF)
    • Bag of Words
    • Sentiment Analysis
    • Word Embedding

    3. Guide to Natural Language Processing  

    Often we extract text data from the web and we need strip out HTML before feeding to ML algotithms.
    Dipanjan (DJ) Sarkar in his post ‘A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text’ is showing how to do this.

    Here we can find project for downloading html text with beatifulsoup python library, extracting useful text from html, doing part analysis, sentiment analysis and NER.
    In this post we can find the foolowing text processing python libraries for machine learning :
    spacy – spaCy now features new neural models for tagging, parsing and entity recognition (in v2.0)
    nltk – leading platform for building Python programs for natural language processing.

    Basic text preprocessing steps covered:

    • Removing HTML tags
    • Removing accented characters, Special Characters, Stopwords
    • Expanding Contractions
    • Stemming
    • Lemmatization

    In addition to above basic steps the guide is also covering parsing techniques for understanding the structure and syntax of language that includes

    • Parts of Speech (POS) Tagging
    • Shallow Parsing or Chunking
    • Constituency Parsing
    • Dependency Parsing
    • Named Entity Recognition

    4. Natural Language Processing

    In this article ‘Natural Language Processing is Fun’ you will find descriptions on the text pre-processing steps:

    • Sentence Segmentation
    • Word Tokenization
    • Predicting Parts of Speech for Each Token
    • Text Lemmatization
    • Identifying Stop Words
    • Dependency Parsing
    • Named Entity Recognition (NER)
    • Coreference Resolution

    The article explains thoroughly how computers understand textual data by dividing text processing into the above steps. Diagrams help understand concepts very easy. The steps above constitute natural language processing text pipeline and it turn out that with the spacy you can do most of them with only few lines.

    Here is the example of using spacy:

    import spacy
    
    # Load the large English NLP model
    nlp = spacy.load('en_core_web_lg')
    
    
    f="C:\\Users\\pythonrunfiles\\textinputdata.txt"
    
    with open(f) as ftxt:
         text = ftxt.read()
         
    print (text)     
    
    
    # Parse the text with spaCy.
    doc = nlp(text)
    
    
    for token in doc:
        print(token.text)
        
        
    for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
              token.shape_, token.is_alpha, token.is_stop) 
        
        
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
    
    
    Partial output of above program: 
    ....
    I
    was
    in
    that
    city
    10
    days
    ago
    .
    ....
    I -PRON- PRON PRP nsubj X True False
    was be VERB VBD ROOT xxx True False
    in in ADP IN prep xx True False
    that that DET DT det xxxx True False
    city city NOUN NN pobj xxxx True False
    10 10 NUM CD nummod dd False False
    days day NOUN NNS npadvmod xxxx True False
    ago ago ADV RB advmod xxx True False
    . . PUNCT . punct . False False
    ....
    10 days ago 66 77 DATE
    

    5. Learning from Text Summarization Project

    This is project ‘Text Summarization with Amazon Reviews’ where review are about food, but the first part contains text preprocessing steps. The preprocessing steps include converting to lowercase, replacing contractions with their longer forms, removing unwanted characters.

    For removing contractions author is using a list of contractions from stackoverflow
    http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
    Using the list and the code from this link, we can replace, for example:
    you’ve with you have
    she’s with she is

    6. Text Preprocessing Methods for Deep Learning

    This is a primer on word2vec embeddings but it includes basic preprocessing techniques for text data such as

    • Cleaning Special Characters and Removing Punctuations
    • Cleaning Numbers
    • Removing Misspells
    • Removing Contractions

    7. Text Preprocessing in Python

    This is another great resource about text preprocessing steps with python. In addition to basic steps, we can find here how to do collocation extraction, relationship extraction and NER. The paper has many links to other articles on text preprocessing techniques.

    Also this paper has comparison of many different natural language processing toolkits like NLTK, Spacy by features, programming language, license. The table has the links to project for text processing toolkit. So it is very handy information where you can find description of text processing steps, tools used, examples of using and link to many other resources.

    Conclusion

    The above resources show how to perform textual data preprocessing from basic step to advanced, with different python libraries. Below you can find the above links and few more links to resources on the same topic.
    Feel free to provide feedback, comments, links to resources that are not mentioned here.

    References

    1. textcleaner
    2. Ultimate guide to deal with Text Data (using Python) – for Data Scientists & Engineers
    3. A Practitioner’s Guide to Natural Language Processing (Part I) — Processing & Understanding Text
    4. Natural Language Processing is Fun
    5. Text Summarization with Amazon Reviews
    6. NLP Learning Series: Text Preprocessing Methods for Deep Learning
    7. Text Preprocessing in Python: Steps, Tools, and Examples
    8. Text Data Preprocessing: A Walkthrough in Python
    9. Text Preprocessing, Keras Documentation
    10. What is the best way to remove accents in a Python unicode string?
    11. PREPROCESSING DATA FOR NLP
    12. Processing Raw Text
    13. TextBlob: Simplified Text Processing

    Chatbots Examples with ChatterBot – How to Add Logic

    In the previous post How to Create a Chatbot with ChatBot Open Source and Deploy It on the Web I wrote how to deploy ChatterBot on pythonanywhere hosting site with Django webfamework. In this post we will look at few useful chatbots examples for implementing logic in our chatbot. This chatbot was developed in the previous post and is based on ChatterBot python library.

    Making Chatbot Start Conversation with Specific Question

    Suppose we want to start conversation with specific sentence that chatbot needs to show. For example, when the user open website, the chatbot can start with the specific question: “Did you find what you were looking for?”
    Or, if you are building chatbot for conversation about previous day/week work, you probably want to start with “How was you previous day/week in terms of progress to goal?” How can we do this with ChatterBot framework?

    Conversation diagram

    It turns out, that ChatterBot has several logic adapters that allow to build conversation per different requirements.

    Here is how I used logic adapter SpecificResponseAdapter for chatbot to start with initial predefined question:

     chatbot = ChatBot("mybot",
           logic_adapters=[
           {
                'import_path': 'chatterbot.logic.SpecificResponseAdapter',
                'input_text': 'prev_day_disk',
                'output_text': 'How much did you do toward your goal on previous day?'
            }
            .....
    

    In views.py that was created in prev. post[1], I put input “prev_day_disk” to runbot instead of blank string. Because in the beginning of chat there is no user input and I used this to enter input_text and get desired output as specified in output_text.

    def press_my_buttons(request):
        resp=""
        conv=""
        if request.POST:
            conv=request.POST.get('conv', '')
            user_input=request.POST.get('user_input', '')
    
            userid=request.POST.get('userid', '')
            if (userid == ""):
                userid=uuid.uuid4()
    
            resp=runbot(user_input, request, userid)
    
         
            conv=conv + "" + str(user_input) + "\n" + "BOT:"+ str(resp) + "\n"
        else:
            resp=runbot("prev_day_disk", request, "")
            conv =  "BOT:"+ str(resp) + "\n";
       
        return render(request, 'my_template.html', {'conv': conv })
    

    SpecificResponseAdapter can be used also in other places of conversation (not just in the beginning). For example we could use criteria if there is no input from user during 15 secs and user is not typing anything (not sure yet how easily it is to check if user typing or not) then switch conversation to new topic by making chatbot app send new question.

    How to Add Intelligence to Chatbot App

    After the user replied to response how was his/her week, I want chatbot to be able to recognize the response as belonging to one of the 3 groups: bad, so-so, good. This is a machine learning classification problem. However here we will not create text classification algorithm, instead we will use built in functionality.

    We will use another logic adapter, called BestMatch.
    With this adapter we need specify statement_comparison_function and response_selection_method :

    chatbot = ChatBot("mybot",
           logic_adapters=[
           {
                'import_path': 'chatterbot.logic.SpecificResponseAdapter',
                'input_text': 'How much did you do toward your goal on previous day?',
                'output_text': 'How much did you do toward your goal on previous day?'
            },
            {
                "import_path": "chatterbot.logic.BestMatch",
                "statement_comparison_function": "chatterbot.comparisons.levenshtein_distance",
                "response_selection_method": "chatterbot.response_selection.get_first_response"
            }
    

    Best Match Adapter – is a logic adapter that returns a response based on known responses to the closest matches to the input statement. [2]

    The best match adapter uses a function to compare the input statement to known statements. Once it finds the closest match to the input statement, it uses another function to select one of the known responses to that statement.

    To use this adapter for the above example I need at minimum create 1 samples per each group. In the below example I used following for testing of chatbot on 2 groups (skipped so-so).

     if (train_bot == True):
      chatbot.train([
        "I did not do much this week",
        "Did you run into the problems with programs or just did not have time?"
      ])
    
    
      chatbot.train([
        "I did a lot of progress",
        "Fantastic! Keep going on"
      ])
    

    After the training, if the user enters something close to “I did not do much this week” the chatbot will respond with “Did you run into the problems with programs or just did not have time?”, and if user enters something like “Did a lot of progress” the bot response will be “Fantastic! Keep going on” even if the input is slightly different from training.

    So we looked how to build a chatbot with logic that makes chatbot able to ask questions as needed or classify user input in several buckets and respond to user input accordingly. The code for this post is provided at the link listed below[3]

    References
    1. How to Create a Chatbot with ChatBot Open Source and Deploy It on the Web
    2. ChatterBot – Logic
    3.Python Chatterbot Example with Added Logic – source code

    How to Create a Chatbot with ChatBot Open Source and Deploy It on the Web

    Chatbots have become very popular due to progress in AI, ML and NLP. They are now used on many websites. With increased popularity of chatbots there are many different frameworks to create chatbot. We will explore one of such framework in this post. We will review how to create a chatbot and deploy online based on open source for ChatterBot. Our platform will be Django on pythonanywhere.

    This chatbot tutorial introduces the basic concepts and terms needed to understand and deploy chatbot ChatterBot and provides a simple usage example.

    What is ChatterBot?

    ChatterBot is a Python library that makes it easy to generate automated responses to a user’s input. To produce different types of responses ChatterBot uses a selection of machine learning algorithms such as search and /or classification algorithms. This makes it easy for developers to create chat bots and automate conversations with users. [1]

    To install this chatbot open source you need just run:
    pip install chatterbot

    ChatterBot Features

    The functionality of chatterBot is provided through adapters – a pluggable class that allows a ChatBot instance to execute some kind of functionality.

    ChatBot has the following groups of adapters
    input adapters
    output adapters
    storage adapters
    logic adapters

    Inside of each group there several adapters that support different functionalities. For example within logic adapters we can use Best Match Adapter, Time Logic Adapter and few other.

    Here is the example how do you run chatbot. Below is the output of running the code:

    Result of program run:
    USER: How are you doing?
    BOT:I am fine
    USER: Hello
    BOT:Hi there!
    USER: Good morning!
    BOT:How are today!
    USER: Do you like machine learning?
    BOT:Yes, I like machine learning
    USER: How do I make a neural network?
    BOTI am sorry, but I do not understand.
    USER: Let us talk about current activities
    BOT:What are working now?
    USER: I am just browsing Internet for news
    BOT:What a waste of time! Dont you have any other things to do?
    USER: I am working on python script to make new chatbot
    BOT:This is great. Keep working on this
    USER: Bye
    BOT:Bye
    

    During the testing I confirmed that it correctly responds to new (but similar) inputs even if we did not train exactly.

    Once you trained the bot, the result of training is staying there even after turning off/on PC. So if you run program multiple times, you need run training just first time. You still can run training later, for example if you want retrain or update python chatbot code.

    Deployment

    This section describes how to deploy ChatterBot on pythonanywhere web hosting site with Django programming web framework.

    PythonAnywhere is an online integrated development environment (IDE) and web hosting service based on the Python programming language.[3] It has free account that allows to deploy our Chatbot. Django is needed to connect web front interface with our python code for ChatterBot in the backend. Other web frameworks like Flask can be used on pythonanywhere (instead of Django).

    Below is the diagram that is showing setup that will be described below.

    Chatbot online diagram
    Chatbot online diagram

    Here is how ChatterBot can be deployed on pythonanywhere with Django:

  • Go to pythonanywhere and select plan signup
  • Select Create new web framework
  • Select Django
  • Select python version (3.6 was used in this example)
  • Select project name directory
  • It will create the following:
    /home/user/project_name
    manage.py
    my_template.html
    views.py
    ——————-project_name
    init_.py
    settings.py
    urls.py
    wsgi.py

  • create folder cbot under user folder
  • /home/user/cbot

  • inside of this folder create
  • __init__.py (this is just empty file)
    chatbotpy.py

    Inside chatbotpy wrap everything into function runbot like below. In this function we are initiating chatbot object, taking user input, if needed we train chatbot and then asking for chatbot response. The response provided by chatbot is the output of this function. The input of this function is the user input that we are getting through web.

    def runbot(inp, train_bot=False):
     from chatterbot import ChatBot
    
     chatbot = ChatBot("mychatbot",
            logic_adapters=[
            {
                "import_path": "chatterbot.logic.BestMatch",
                "statement_comparison_function": "chatterbot.comparisons.levenshtein_distance",
                "response_selection_method": "chatterbot.response_selection.get_first_response"
            },
            {
                'import_path': 'chatterbot.logic.LowConfidenceAdapter',
                'threshold': 0.65,
                'default_response': 'I am sorry, but I do not understand.'
            }
     ],
     trainer='chatterbot.trainers.ListTrainer')
    
    
     if (train_bot == True):
        print "Training"
     """
      Insert here training code from the python code for ChatterBot example 
     """
    
     response = chatbot.get_response(inp)
     return (response)
    

    Now update views.py like below code box. Here we are taking user input from web, feeding this input to runbot function and sending output of runbot function (which is chatbot reponse) to web template.

    
    from django.shortcuts import render
    from cbot.chatbotpy import runbot
    
    def press_my_buttons(request):
        resp=""
        conv=""
        if request.POST:
            conv=request.POST.get('conv', '')
            user_input=request.POST.get('user_input', '')
    
            resp=runbot(user_input)
    
            conv=conv + "" + str(user_input) + "\n" + "BOT:"+ str(resp) + "\n"
        else:
            resp=runbot("")
            conv =  "BOT:"+ str(resp) + "\n";
       
        return render(request, 'my_template.html', {'conv': conv})
    

    Now update my_template.html like below. Here we just show new response together with previous conversation information.

    <html>
    <form method="post">
        {% csrf_token %}
    
        <textarea rows=20 cols=60>{{conv}}</textarea>
    
        <br><br>
        <input type="textbox" name="user_input" value=""/>
    
        <button type="submit">Submit</button>
    
        <input type="hidden" name =conv  value="{{conv}}" />
        {{resp}}
    </form>
    </html>
    

    Now update some configuration.
    Update or confirm manage.py to include the line with settings.

    if __name__ == '__main__':
        os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'cbotdjango.settings')
    

    Update or confirm urls.py to have path like below

    import sys
    path = "/home/user/cbotdjango"
    if path not in sys.path:
        sys.path.append(path)
    

    Now you are ready do testing chatbot online and should see the screen similar in the setup diagram on right side.

    Conclusion

    We saw how to train ChatterBot and some functionality of it. We investigated how to install ChatterBot on pythonanywhere with Django web framework. Hope this will make easy to deploy chatbot in case you are going this route. If you have any tips or anything else to add, please leave a comment below.

    References
    1. ChatterBot
    2. Python chatbot code example
    3. Python Anywhere