Twitter Text Mining with Python

In this post (and few following posts) we will look how to get interesting information by extracting links from results of Twitter search by keywords and using machine learning text mining. While there many other posts on the same topic, we will cover also additional small steps that are needed to process data. This includes such tasks as unshorting urls, setting date interval, saving or reading information.

Below we will focus on extracting links from results of Twitter search API python.

Getting Login Information for Twitter API

The first step is set up application on Twitter and get login information. This is already described in some posts on the web [1].
Below is the code snippet for this:

your code here
import tweepy as tw
    
CONSUMER_KEY ="xxxxx"
CONSUMER_SECRET ="xxxxxxx"
OAUTH_TOKEN = "xxxxx"
OAUTH_TOKEN_SECRET = "xxxxxx"

auth = tw.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tw.API(auth, wait_on_rate_limit=True)

Defining the Search Values

Now you can do search by keywords or hashtags and get tweets.
When we do search we might want to specify the start day so it will give results dated on this start day or after.

For this we can code as the following:

from datetime import datetime
from datetime import timedelta

NUMBER_of_TWEETS = 20
SEARCH_BEHIND_DAYS=60
today_date=datetime.today().strftime('%Y-%m-%d')


today_date_datef = datetime.strptime(today_date, '%Y-%m-%d')
start_date = today_date_datef - timedelta(days=SEARCH_BEHIND_DAYS)


for search_term in search_terms:
  tweets = tw.Cursor(api.search,
                   q=search_term,
                   lang="en",
                   since=SEARCH_BEHIND_DAYS).items(NUMBER_of_TWEETS)

The above search will return 20 tweets and will look only within 60 days from day of search. If we want use fixed date we can replace with since=’2019-12-01′

Processing Extracted Links

Once we got tweets text we can extract links. However we will get different types of links. Some are internal twitter links, some are shorten, some are regular urls.

So here is the function to sort out the links. We do not need internal links – the links that belong to Twitter navigation or other functionality.

try:
    import urllib.request as urllib2
except ImportError:
    import urllib2


import http.client
import urllib.parse as urlparse   

def unshortenurl(url):
    parsed = urlparse.urlparse(url) 
    h = http.client.HTTPConnection(parsed.netloc) 
    h.request('HEAD', parsed.path) 
    response = h.getresponse() 
    if response.status >= 300 and response.status < 400 and response.getheader('Location'):
        return response.getheader('Location') 
    else: return url 

Once we got links we can save urls information in csv file. Together with the link we save twitter text, date.
Additionally we count number of hashtags and links and also save this information into csv files. So the output of program is 3 csv files.

Conclusion

Looking in the output file we can quickly identify the links of interest. For example just during the testing this script I found two interesting links that I was not aware. In the following post we will look how to do even more automation for finding cool links using Twitter text mining.

Below you can find full source code and the references to web resources that were used for this post or related to this topic.

# -*- coding: utf-8 -*-

import tweepy as tw
import re
import csv

from datetime import datetime
from datetime import timedelta

NUMBER_of_TWEETS = 20
SEARCH_BEHIND_DAYS=60
today_date=datetime.today().strftime('%Y-%m-%d')


today_date_datef = datetime.strptime(today_date, '%Y-%m-%d')
start_date = today_date_datef - timedelta(days=SEARCH_BEHIND_DAYS)
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2


import http.client
import urllib.parse as urlparse   

def unshortenurl(url):
    parsed = urlparse.urlparse(url) 
    h = http.client.HTTPConnection(parsed.netloc) 
    h.request('HEAD', parsed.path) 
    response = h.getresponse() 
    if response.status >= 300 and response.status < 400 and response.getheader('Location'):
        return response.getheader('Location') 
    else: return url    
    
    
CONSUMER_KEY ="xxxxx"
CONSUMER_SECRET ="xxxxxxx"
OAUTH_TOKEN = "xxxxxxxx"
OAUTH_TOKEN_SECRET = "xxxxxxx"


auth = tw.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tw.API(auth, wait_on_rate_limit=True)
# Create a custom search term 

search_terms=["#chatbot -filter:retweets", 
              "#chatbot+machine_learning -filter:retweets", 
              "#chatbot+python -filter:retweets",
              "text classification -filter:retweets",
              "text classification python -filter:retweets",
              "machine learning applications -filter:retweets",
              "sentiment analysis python  -filter:retweets",
              "sentiment analysis  -filter:retweets"]
              
        
              
def count_urls():
       url_counted = dict() 
       url_count = dict()
       with open('data.csv', 'r', encoding="utf8" ) as csvfile: 
           line = csvfile.readline()
           while line != '':  # The EOF char is an empty string
            
               line = csvfile.readline()
               items=line.split(",")
               if len(items) < 3 :
                          continue
                           
               url=items[1]
               twt=items[2]
               # key =  Tweet and Url
               key=twt[:30] + "___" + url
               
               if key not in url_counted:
                      url_counted[key]=1
                      if url in url_count:
                           url_count[url] += 1
                      else:
                           url_count[url] = 1
       print_count_urls(url_count)             

       
def print_count_urls(url_count_data):
   
         for key, value in url_count_data.items():
              print (key, "=>", value)
              
         with open('data_url_count.csv', 'w', encoding="utf8", newline='' ) as csvfile_link_count: 
            fieldnames = ['URL', 'Count']
            writer = csv.DictWriter(csvfile_link_count, fieldnames=fieldnames)
            writer.writeheader() 
            
            for key, value in url_count_data.items():
                 writer.writerow({'URL': key, 'Count': value })   
            
           
def extract_hash_tags(s):
    return set(part[1:] for part in s.split() if part.startswith('#'))
    

   
def save_tweet_info(tw, twt_dict, htags_dict ):
   
    if tw not in twt_dict:
        htags=extract_hash_tags(tw)
        twt_dict[tw]=1
        for ht in htags:
            if ht in htags_dict:
                htags_dict[ht]=htags_dict[ht]+1
            else:   
                htags_dict[ht]=1


def print_count_hashtags(htags_count_data):
        
         for key, value in htags_count_data.items():
              print (key, "=>", value)
              
         with open('data_htags_count.csv', 'w', encoding="utf8", newline='' ) as csvfile_link_count: 
            fieldnames = ['Hashtag', 'Count']
            writer = csv.DictWriter(csvfile_link_count, fieldnames=fieldnames)
            writer.writeheader() 
            
            for key, value in htags_count_data.items():
                 writer.writerow({'Hashtag': key, 'Count': value })          
        


tweet_dict = dict() 
hashtags_dict = dict()

                 
for search_term in search_terms:
  tweets = tw.Cursor(api.search,
                   q=search_term,
                   lang="en",
                   #since='2019-12-01').items(40)
                   since=SEARCH_BEHIND_DAYS).items(NUMBER_of_TWEETS)

  with open('data.csv', 'a', encoding="utf8", newline='' ) as csvfile: 
     fieldnames = ['Search', 'URL', 'Tweet', 'Entered on']
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
     writer.writeheader()
     

     for tweet in tweets:
         urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
   
         save_tweet_info(tweet.text, tweet_dict, hashtags_dict ) 
         for url in urls:
          try:
            res = urllib2.urlopen(url)
            actual_url = res.geturl()
         
            if ( ("https://twitter.com" in actual_url) == False):
                
                if len(actual_url) < 32:
                    actual_url =unshortenurl(actual_url) 
                print (actual_url)
              
                writer.writerow({'Search': search_term, 'URL': actual_url, 'Tweet': tweet.text, 'Entered on': today_date })
              
          except:
              print (url)    

            
print_count_hashtags(hashtags_dict)
count_urls()      

References

1. Text mining: Twitter extraction and stepwise guide to generate a word cloud
2. Analyze Word Frequency Counts Using Twitter Data and Tweepy in Python
3. unshorten-url-in-python-3
4. how-can-i-un-shorten-a-url-using-python
5. extracting-external-links-from-tweets-in-python

Leave a Comment