In this post (and few following posts) we will look how to get interesting information by extracting links from results of Twitter search by keywords and using machine learning text mining. While there many other posts on the same topic, we will cover also additional small steps that are needed to process data. This includes such tasks as unshorting urls, setting date interval, saving or reading information.
Below we will focus on extracting links from results of Twitter search API python.
Getting Login Information for Twitter API
The first step is set up application on Twitter and get login information. This is already described in some posts on the web [1].
Below is the code snippet for this:
your code here
import tweepy as tw CONSUMER_KEY ="xxxxx" CONSUMER_SECRET ="xxxxxxx" OAUTH_TOKEN = "xxxxx" OAUTH_TOKEN_SECRET = "xxxxxx" auth = tw.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET) api = tw.API(auth, wait_on_rate_limit=True)
Defining the Search Values
Now you can do search by keywords or hashtags and get tweets.
When we do search we might want to specify the start day so it will give results dated on this start day or after.
For this we can code as the following:
from datetime import datetime from datetime import timedelta NUMBER_of_TWEETS = 20 SEARCH_BEHIND_DAYS=60 today_date=datetime.today().strftime('%Y-%m-%d') today_date_datef = datetime.strptime(today_date, '%Y-%m-%d') start_date = today_date_datef - timedelta(days=SEARCH_BEHIND_DAYS) for search_term in search_terms: tweets = tw.Cursor(api.search, q=search_term, lang="en", since=SEARCH_BEHIND_DAYS).items(NUMBER_of_TWEETS)
The above search will return 20 tweets and will look only within 60 days from day of search. If we want use fixed date we can replace with since=’2019-12-01′
Processing Extracted Links
Once we got tweets text we can extract links. However we will get different types of links. Some are internal twitter links, some are shorten, some are regular urls.
So here is the function to sort out the links. We do not need internal links – the links that belong to Twitter navigation or other functionality.
try: import urllib.request as urllib2 except ImportError: import urllib2 import http.client import urllib.parse as urlparse def unshortenurl(url): parsed = urlparse.urlparse(url) h = http.client.HTTPConnection(parsed.netloc) h.request('HEAD', parsed.path) response = h.getresponse() if response.status >= 300 and response.status < 400 and response.getheader('Location'): return response.getheader('Location') else: return url
Once we got links we can save urls information in csv file. Together with the link we save twitter text, date.
Additionally we count number of hashtags and links and also save this information into csv files. So the output of program is 3 csv files.
Conclusion
Looking in the output file we can quickly identify the links of interest. For example just during the testing this script I found two interesting links that I was not aware. In the following post we will look how to do even more automation for finding cool links using Twitter text mining.
Below you can find full source code and the references to web resources that were used for this post or related to this topic.
# -*- coding: utf-8 -*- import tweepy as tw import re import csv from datetime import datetime from datetime import timedelta NUMBER_of_TWEETS = 20 SEARCH_BEHIND_DAYS=60 today_date=datetime.today().strftime('%Y-%m-%d') today_date_datef = datetime.strptime(today_date, '%Y-%m-%d') start_date = today_date_datef - timedelta(days=SEARCH_BEHIND_DAYS) try: import urllib.request as urllib2 except ImportError: import urllib2 import http.client import urllib.parse as urlparse def unshortenurl(url): parsed = urlparse.urlparse(url) h = http.client.HTTPConnection(parsed.netloc) h.request('HEAD', parsed.path) response = h.getresponse() if response.status >= 300 and response.status < 400 and response.getheader('Location'): return response.getheader('Location') else: return url CONSUMER_KEY ="xxxxx" CONSUMER_SECRET ="xxxxxxx" OAUTH_TOKEN = "xxxxxxxx" OAUTH_TOKEN_SECRET = "xxxxxxx" auth = tw.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET) api = tw.API(auth, wait_on_rate_limit=True) # Create a custom search term search_terms=["#chatbot -filter:retweets", "#chatbot+machine_learning -filter:retweets", "#chatbot+python -filter:retweets", "text classification -filter:retweets", "text classification python -filter:retweets", "machine learning applications -filter:retweets", "sentiment analysis python -filter:retweets", "sentiment analysis -filter:retweets"] def count_urls(): url_counted = dict() url_count = dict() with open('data.csv', 'r', encoding="utf8" ) as csvfile: line = csvfile.readline() while line != '': # The EOF char is an empty string line = csvfile.readline() items=line.split(",") if len(items) < 3 : continue url=items[1] twt=items[2] # key = Tweet and Url key=twt[:30] + "___" + url if key not in url_counted: url_counted[key]=1 if url in url_count: url_count[url] += 1 else: url_count[url] = 1 print_count_urls(url_count) def print_count_urls(url_count_data): for key, value in url_count_data.items(): print (key, "=>", value) with open('data_url_count.csv', 'w', encoding="utf8", newline='' ) as csvfile_link_count: fieldnames = ['URL', 'Count'] writer = csv.DictWriter(csvfile_link_count, fieldnames=fieldnames) writer.writeheader() for key, value in url_count_data.items(): writer.writerow({'URL': key, 'Count': value }) def extract_hash_tags(s): return set(part[1:] for part in s.split() if part.startswith('#')) def save_tweet_info(tw, twt_dict, htags_dict ): if tw not in twt_dict: htags=extract_hash_tags(tw) twt_dict[tw]=1 for ht in htags: if ht in htags_dict: htags_dict[ht]=htags_dict[ht]+1 else: htags_dict[ht]=1 def print_count_hashtags(htags_count_data): for key, value in htags_count_data.items(): print (key, "=>", value) with open('data_htags_count.csv', 'w', encoding="utf8", newline='' ) as csvfile_link_count: fieldnames = ['Hashtag', 'Count'] writer = csv.DictWriter(csvfile_link_count, fieldnames=fieldnames) writer.writeheader() for key, value in htags_count_data.items(): writer.writerow({'Hashtag': key, 'Count': value }) tweet_dict = dict() hashtags_dict = dict() for search_term in search_terms: tweets = tw.Cursor(api.search, q=search_term, lang="en", #since='2019-12-01').items(40) since=SEARCH_BEHIND_DAYS).items(NUMBER_of_TWEETS) with open('data.csv', 'a', encoding="utf8", newline='' ) as csvfile: fieldnames = ['Search', 'URL', 'Tweet', 'Entered on'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for tweet in tweets: urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text) save_tweet_info(tweet.text, tweet_dict, hashtags_dict ) for url in urls: try: res = urllib2.urlopen(url) actual_url = res.geturl() if ( ("https://twitter.com" in actual_url) == False): if len(actual_url) < 32: actual_url =unshortenurl(actual_url) print (actual_url) writer.writerow({'Search': search_term, 'URL': actual_url, 'Tweet': tweet.text, 'Entered on': today_date }) except: print (url) print_count_hashtags(hashtags_dict) count_urls()
References
1. Text mining: Twitter extraction and stepwise guide to generate a word cloud
2. Analyze Word Frequency Counts Using Twitter Data and Tweepy in Python
3. unshorten-url-in-python-3
4. how-can-i-un-shorten-a-url-using-python
5. extracting-external-links-from-tweets-in-python