Sentiment Analysis with VADER

Sentiment analysis (also known as opinion mining ) refers to the use of natural language processing, text analysis, computational linguistics to systematically identify, extract, quantify, and study affective states and subjective information. [1] In short, Sentiment analysis gives an objective idea of whether the text uses mostly positive, negative, or neutral language. [2]

Sentiment analysis software can assist estimate people opinion on the events in finance world, generate reports for relevant information, analyze correlation between events and stock prices.


Image by Gino Crescoli from Pixabay

The problem

In this post we investigate how to extract information about company and detect its sentiment. For each text sentence or paragraph we will detect its positivity or negativity by calculating sentiment score. This also called polarity. Polarity in sentiment analysis refers to identifying sentiment orientation (positive, neutral, and negative) in the text.

Given the list of companies we want to find polarity of sentiment in the text that has names of companies from the list. Below is the description how it can be implemented.

Getting Data

We will use google to collect data. For this we search google via script for documents with some predefined keywords.
It will return the links that we will save to array.

try: 
    from googlesearch import search 
except ImportError:  
    print("No module named 'google' found") 
  
# to search 
query = "financial_news Warren Buffett 2019"

links=[]  
for j in search(query, tld="co.in", num=10, stop=10, pause=6): 
    print(j) 
    links.append(j)

Preprocessing

After we got links, we need get text documents and remove not needed text and characters. So in this step we remove html tags, not valid characters. We keep however paragraph tags. Using paragraph tag we divide text document in smaller text units. After that we remove p tags.

para_data=[]


def get_text(url):
   print (url) 
   
   try:
      req  = requests.get(url, timeout=5)
   except: 
      return "TIMEOUT ERROR"  
  
   data = req.text
   soup = BeautifulSoup(data, "html.parser")
   
     
   paras=[]
   paras_ = soup.find_all('p')
   filtered_paras= filter(tag_visible, paras_)
   for s in filtered_paras:
       paras.append(s)
   if len(paras) > 0:
      for i, para in enumerate(paras):
           para=remove_tags(para)
           # remove non text characters
           para_data.append(clean_txt(para))

Calculating Sentiment

Now we calculate sentiment score using VADER (Valence Aware Dictionary and sEntiment Reasoner) VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments.[3] Based on calculated sentiment we build plot. In this example we only build plot for first company name which is Coca Cola.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
  

def sentiment_scores(sentence): 
    
    # Create a SentimentIntensityAnalyzer object. 
    sid_obj = SentimentIntensityAnalyzer() 
  
    # polarity_scores method of SentimentIntensityAnalyzer 
    # oject gives a sentiment dictionary. 
    # which contains pos, neg, neu, and compound scores. 
    sentiment_dict = sid_obj.polarity_scores(sentence) 
      
    print("Overall sentiment dictionary is : ", sentiment_dict) 
    print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative") 
    print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral") 
    print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive")

    
    # decide sentiment as positive, negative and neutral 
    if sentiment_dict['compound'] >= 0.05 : 
        print("Positive") 
        
    elif sentiment_dict['compound'] <= - 0.05 : 
        print("Negative") 
  
    else : 
        print("Neutral") 
    return sentiment_dict['compound'] 

Below you can find full source code.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup, NavigableString
from bs4.element import Comment
import requests
import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text_string):
    print (text_string)
    
    return TAG_RE.sub('', str(text_string))

MIN_LENGTH_of_document = 40
MIN_LENGTH_of_word = 2

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
  
# function to print sentiments 
# of the sentence. 
# below function based on function from https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
def sentiment_scores(sentence): 
    
    # Create a SentimentIntensityAnalyzer object. 
    sid_obj = SentimentIntensityAnalyzer() 
  
    # polarity_scores method of SentimentIntensityAnalyzer 
    # oject gives a sentiment dictionary. 
    # which contains pos, neg, neu, and compound scores. 
    sentiment_dict = sid_obj.polarity_scores(sentence) 
      
    print("Overall sentiment dictionary is : ", sentiment_dict) 
    print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative") 
    print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral") 
    print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive")
    print("Sentence Overall Rated As", end = " ") 
  
    return sentiment_dict['compound']    

def remove_min_words(txt):
   # https://www.w3resource.com/python-exercises/re/python-re-exercise-49.php
   shortword = re.compile(r'\W*\b\w{1,1}\b')
   return(shortword.sub('', txt))        
    
        
def clean_txt(text):
  
   text = re.sub('[^A-Za-z.  ]', ' ', str(text))
   text=' '.join(text.split())
   text = remove_min_words(text)
   text=text.lower()
   text = text if  len(text) >= MIN_LENGTH_of_document else ""
   return text
        

def between(cur, end):
    while cur and cur != end:
        if isinstance(cur, NavigableString):
            text = cur.strip()
            if len(text):
                yield text
        cur = cur.next_element
        
def next_element(elem):
    while elem is not None:
        # Find next element, skip NavigableString objects
        elem = elem.next_sibling
        if hasattr(elem, 'name'):
            return elem        

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head',  'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True
    
para_data=[]


def get_text(url):
   print (url) 
   
   try:
      req  = requests.get(url, timeout=5)
   except: 
      return "TIMEOUT ERROR"  
  
   data = req.text
   soup = BeautifulSoup(data, "html.parser")
   
     
   paras=[]
   paras_ = soup.find_all('p')
   filtered_paras= filter(tag_visible, paras_)
   for s in filtered_paras:
       paras.append(s)
   if len(paras) > 0:
      for i, para in enumerate(paras):
           para=remove_tags(para)
           # remove non text characters
           para_data.append(clean_txt(para))
           
     

try: 
    from googlesearch import search 
except ImportError:  
    print("No module named 'google' found") 
  
# to search 
query = "coca cola 2019"

links=[]  
for j in search(query, tld="co.in", num=25, stop=25, pause=6): 
    print(j) 
    links.append(j)
  
# Here our list consists from one company name, but it can include more than one.  
orgs=["coca cola" ]    
 
results=[] 
count=0  


def update_dict_value( dict, key, value):
    if key in dict:
           dict[key]= dict[key]+value
    else:
           dict[key] =value
    return dict

for link in links:
    # will update paras - array of paragraphs
    get_text(link)
    
    for pr in para_data:
              
        for org in orgs:
            if pr.find (org) >=0:
                # extract sentiment
                score=0
                results.append ([org, sentiment_scores(pr), pr])


positive={}
negative={}
positive_sentiment={}
negative_sentiment={}

  
for i in range(len(results)):
    org = results[i][0]
   
    if (results[i][1] >=0):
        positive = update_dict_value( positive, org, 1)
        positive_sentiment =  update_dict_value( positive_sentiment, org,results[i][1])

    else:
        negative = update_dict_value( negative, org, 1)
        negative_sentiment =  update_dict_value( negative_sentiment, org,results[i][1])

for org in orgs:
   
    positive_sentiment[org]=positive_sentiment[org] / positive[org]
    print (negative_sentiment[org])
    negative_sentiment[org]=negative_sentiment[org] / negative[org]   

import matplotlib.pyplot as plt 


# x-coordinates of left sides of bars  
labels = ['negative', 'positive'] 
  
# heights of bars 
sentiment = [(-1)*negative_sentiment[orgs[0]], positive_sentiment[orgs[0]]] 


# labels for bars 
tick_label = ['negative', 'positive'] 
  
# plotting a bar chart 
plt.bar(labels, sentiment, tick_label = tick_label, 
        width = 0.8, color = ['red', 'green']) 
  
# naming the x-axis 
plt.xlabel('x - axis') 
# naming the y-axis 
plt.ylabel('y - axis') 
# plot title 
plt.title('Sentiment Analysis') 
  
# function to show the plot 
plt.show() 

References
1. Sentiment analysis Wikipedia
2. What is a “Sentiment Score” and how is it measured?
3. VADER-Sentiment-Analysis

Sentiment Analysis of Twitter Data

Sentiment analysis of text (or opinion mining) allows us to extract opinion from user comments on the web. The applications of sentiment analysis can be such as understanding what customers think about product or product features, discovering user reaction on certain events.

A basic task in sentiment analysis of text is classifying the polarity of a given text from the document. Polarity can be classified as positive, negative, or neutral.

Advanced, “beyond polarity” sentiment classification looks at emotional states such as “angry”, “sad”, and “happy”. [1]

In this post you will find example how to calculate polarity in sentiment analysis for twitter data using python. Polarity in this example will have two labels: positive or negative.
In the end of this post you also will find links to several most comprehensive posts from other websites on the topic twitter sentiment analysis tutorial.

Dataset for Sentiment Analysis of Twitter Data

We will use dataset from Twitter that can be downloaded from this link [3] from CrowdFlower [4]. This dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. About 40000 rows of examples across 13 labels. A subset of this data was used in an experiment for Microsoft’s Cortana Intelligence Gallery.
The dataset has 4 columns
tweet_id
sentiment (for example happy, sad )
author
content

Preprocessing of Twitter Data

We will remove some special characters and links using below function found on Internet.

import re
# below function is based on example from 
# http://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/
def clean_tweet( tweet):
        '''
        Utility function to clean tweet text by removing links, special characters
        using simple regex statements.
        '''
        tweet = tweet.lower() 
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

Also we remove stop words as below

from many_stop_words import get_stop_words
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from itertools import chain


from nltk.classify import NaiveBayesClassifier, accuracy

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english'))   #About 150 stopwords
stop_words.extend(nltk_words)

def remove_stopwords(word_list):
               
        filtered_tweet=""
        for word in word_list:
            word = word.lower() 
            if word not in stopwords.words("english"):
                filtered_tweet=filtered_tweet + " " + word
        
        
        return filtered_tweet.lstrip()

Approach for Tweet Sentiment Analysis

We will divide tweets data into training and testing datasets. For training classifier for detecting polarity in the content column we will use training dataset with content (X) and sentiment (Y) fields.

As we already have emotion column for tweets we do not need do feature selection for classification.

However we will map emotions (13 categories) in positive negative, neutral and skip neutral.

Here is how we do mapping in the script:

polarity = {'empty' : 'N',
                'sadness' : 'N',
                'enthusiasm' : 'P',
                'neutral' : 'neutral',
                'worry' : 'N',
                'surprise' : 'P',
                'love' : 'P',
                'fun' : 'P',
                'hate' : 'N',
                'happiness' : 'P',
                'boredom' : 'N',
                'relief' : 'P',
                'anger' : 'N'
         }  

Text Classification – Using NLTK for Sentiment Analysis

There are different classifications techniques that can be utilized in sentiment analysis, the detailed survey of methods was published in the paper [2]. The paper has also accuracy comparison and sentiment analysis process description.

Our task is to train classifier to detect polarity (negative, positive) for not seen text tweets.
We will use NLTK NaiveBayesClassifier algorithm.

For NLTK we do not need to convert to numeric vectors like we do for ski-learn. We need just tokenize our text and then input to machine learning classification algorithm.

Our vocabulary consists of tweet words and polarity (P or N) for each tweet. Here is how it looks:

vocabulary for sentiment analysis twitter data with NLTK
vocabulary for sentiment analysis twitter data with NLTK

From vocabulary we need to create feature set for Naive Bayes Classifier that we are going to use. In our model each word in the tweet is treated as the feature. Each tweet is “projected” into vocabulary and each word in vocabulary is getting value True if this word is in the given tweet, and value False if the word is not found in the vocabulary. In the end of tweet we have the label for polarity of tweet.

Below is screenshot for feature set, the polarity label (N or P is highlighted, the vocabulary is decreased just to 10 tweets for this picture.

sentiment analysis twitter data - feature set
sentiment analysis twitter data – feature set
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
size = int(len(feature_set) * 0.2)
train_set, test_set = feature_set[size:], feature_set[:size]

classifier = NaiveBayesClassifier.train(train_set)
print(accuracy(classifier, test_set))

Results of Tweet Sentiment Analysis

Here are the results of execution python source code described above:
Accuracy 73%
Run time was long as 50 min and data sample was limited to 1000 rows. May be because laptop has only 6GB memory.

So we learned how to detect negative or positive polarity for sentiment analysis in twitter data. The results are showing that some improvements still would be needed. For example we could better preprocess twitter data using transformation of twitter slang words or short form words to regular words.

Below you can find full python source code.

# sentiment analysis of text twitter data
import re


# below function is based on http://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/
def clean_tweet( tweet):
        '''
        Utility function to clean tweet text by removing links, special characters
        using simple regex statements.
        '''
        tweet = tweet.lower() 
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    

# below few lines are from https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python   
from many_stop_words import get_stop_words
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from itertools import chain

from nltk.classify import NaiveBayesClassifier, accuracy
stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english'))   #About 150 stopwords
stop_words.extend(nltk_words)

def remove_stopwords(word_list):
 
        filtered_tweet=""
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                filtered_tweet=filtered_tweet + " " + word
        
        
        return filtered_tweet.lstrip()
    

filefolder="C:\\Users\\Downloads"
filename=filefolder + "\\text_emotion.csv"
   
polarity = {'empty' : 'N',
                'sadness' : 'N',
                'enthusiasm' : 'P',
                'neutral' : 'neutral',
                'worry' : 'N',
                'surprise' : 'P',
                'love' : 'P',
                'fun' : 'P',
                'hate' : 'N',
                'happiness' : 'P',
                'boredom' : 'N',
                'relief' : 'P',
                'anger' : 'N'
         }  
   
tweets = []
training_data = []
import csv
with open(filename) as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    count=0
    for row in csvReader:
      
        if (row[1] == 'neutral' or row[1] == 'sentiment') :
            continue
        tweet= clean_tweet(row[3])
        tweet = remove_stopwords(tweet.split())
        tweets.append(tweet)
        training_data.append([tweet,  polarity[row[1]] ])
        count=count+1
        if (count >1000):
            break
        
print (training_data)
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]

size = int(len(feature_set) * 0.2)
train_set, test_set = feature_set[size:], feature_set[:size]

classifier = NaiveBayesClassifier.train(train_set)
print(accuracy(classifier, test_set))

External Resources for Twitter Sentiment Analysis Tutorial

Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset and code
The author of this article is showing how to solve the Twitter Sentiment Analysis Practice Problem.

Another Twitter sentiment analysis with Python — Part 1 This is post 1 of series of 11 posts all about sentiment analysis twitter python and related concepts. The posts cover such topics like word embeddings and neural networks. Below are just 2 posts from this series.

Another Twitter sentiment analysis with Python — Part 10 (Neural Network with Doc2Vec/Word2Vec/GloVe)

Another Twitter sentiment analysis with Python — Part 11 (CNN + Word2Vec)

Yet Another Twitter Sentiment Analysis Part 1 — tackling class imbalance

Basic data analysis on Twitter with Python – Here you will find a simple data analysis program that takes a given number of tweets, analyzes them, and displays the data in a scatter plot. The data represent how Twitter users were perceiving the bot created by author and their sentiment.

References
1. Sentiment Analysis
2. Analysis of Various Sentiment Classification Techniques
3. Emotion Dataset
4. Data for Everyone