- Mon 05 August 2019
- Projects
- Tony Hall
- #Sentiment Tweepy Twitter Social Media
Customer sentiment related to Australia's 'big 4' banks is arguably at an all-time low. The Banking Royal Commission put the spotlight on banking practices that fell well short of customer expectations and regulatory requirements, including charging financial advice fees for customers who were not receiving that service as well as charging insurance premuims for dead people. So when I wanted to develop an application to measure customer sentiment it made sense to use one of these banks as the source of my data.
The application below analyses a weeks worth of Tweets relating to one of these banks (NAB) to get an understanding of chaning sentiment over time, as well as the topics that are driving that sentiment. The application uses the Tweepy Twitter API to collect the tweets, the NLTK Natural Language Processing Toolkit to analyse the text and the Vader library to analyse sentiment.
The first step requires installing and importing required libraries and initialising credential for the Tweepy API:
!pip install tweepy nltk google-cloud-language python-telegram-bot vaderSentiment
from tweepy import OAuthHandler
import tweepy
import pandas as pd
import re
from nltk.tokenize import WordPunctTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from datetime import datetime, timedelta
ACCESS_TOKEN = "1156516068988338176-DgBj98sjSOd2at1x07q7mTob9aSrSC"
ACCESS_TOKEN_SECRET = "vsaYYKi8NjSxDdsLHM3dDTjKkltuJhqUgZkfLMk6ffBNk"
CONSUMER_KEY="XfLdl1oZFEguUDD1eJApOFhW8"
CONSUMER_SECRET="9JT3gEHA7g7yusWVA3NjKKIZhbmA7IOcovBc63DVOgOdmiaoy0"
Pull tweets into a dataframe¶
Next we call the API with the keyword @NAB to search for all tweets related to the bank. Twitter only allows a week or two of tweets to be extracted without a premuim account so we'll settle with a start date of 7 days prior to today and a maximum of 10000 tweets.
auth=OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN,ACCESS_TOKEN_SECRET)
api=tweepy.API(auth)
keyword = '@NAB -filter:retweets'
total_tweets = 10000
def search_tweets(keyword, total_tweets):
today_datetime = datetime.today().now()
start_datetime = today_datetime - timedelta(days=7)
today_date = today_datetime.strftime('%Y-%m-%d')
start_date = start_datetime.strftime('%Y-%m-%d')
search_result = tweepy.Cursor(api.search,
tweet_mode='extended',
q=keyword,
since=start_date,
result_type='recent',
lang='en').items(total_tweets)
return search_result
tweets=search_tweets(keyword,total_tweets)
We create a list of all attributes of each tweet:
tweetlist = [[tweet.id, tweet.full_text, tweet.retweet_count, tweet.favorite_count, tweet.source, tweet.created_at, tweet.user.id, tweet.user.screen_name, tweet.user.name, tweet.user.created_at, tweet.user.description, tweet.user.followers_count, tweet.user.friends_count, tweet.user.location, tweet.user.time_zone] for tweet in tweets]
Then pull that into a dataframe for analysis:
df = pd.DataFrame(data=tweetlist, columns=['id','text','retweets','favorite_count','source','created_at','userid','username','name','user_joined','user_desc','user_followers','user_friends','user_location','user_timezone'])
print(df.shape)
df.head()
So we have 414 tweets in our data set
Add sentiment rating for each tweet using Vader¶
We've used the Vader library for sentiment analysis. It's simple to use and performs fairly well on the shorthand text you get from Twitter.
#Initialise the analyser object:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
Let's see what the sentiment is for one random tweet:
print(df.text[26])
print('Sentiment =',analyser.polarity_scores(df.text[26]).get('compound'))
0.9136 is very postive. Run the sentiment analyser for the whole dataframe then look a few results:
sentiment_score = df.apply(lambda row: analyser.polarity_scores(row['text']).get('compound'), axis=1)
sentiment_score.head()
In the first five tweets we have a couple of negative tweets and a three neutral tweets. We'll now add the sentiment as a column to the dataframe:
df['sentiment_score'] = sentiment_score
df.head()
It will be useful later when we graph the sentiment for each day to be able to get the hashtags relating to that day. Then when we see a drop or a spike in sentiment we can look at the topics that drove that sentiment via the hashtags
#Function which takes a date and returns hashtags for that date
def gethashtags(date):
#convert the date string to datetime in order to add one day, then convert back to string
dt = datetime.strptime(date, "%Y-%m-%d")
dt2 = dt + timedelta(days=1)
date2 = datetime.strftime(dt2, "%Y-%m-%d")
text = df[(df['created_at'] > date) & (df['created_at'] < date2)].text.to_string()
hashtags = [word[0:] for word in text.split() if word[0] == '#']
return hashtags
Lets see what the hashtags are for yesterday as an example:
gethashtags('2019-08-21')
Graph mean daily sentiment¶
Let's now graph mean sentiment changes over the past week. First we'll group the tweets by day:
daily_df = df.resample('D',on='created_at').mean() daily_df.head()
import matplotlib.pyplot as plt
import numpy as np
Then we can plot the graph using the MatPlotLib library, adding a label showing the hashtags relevant for that day:
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20, 10))
plt.plot(daily_df.index, daily_df.sentiment_score)
plt.title("Tweet Sentiment - Mean daily sentiment")
#plot the hashtags associated with each day on the graph
for day in daily_df.index:
daystring = day.strftime('%Y-%m-%d')
plt.text(daystring, daily_df.loc[day]['sentiment_score'],gethashtags(daystring))
FinTech and #Mortgagestress relate to positive tweets regarding potential fintech solutions to the problem of mortgage stress.¶
The major negative tweets related to the Clydesdale bank 'Tailored Business Loan scam' which has attracted negative media attention
Graph total, positive and negative tweet counts¶
In addition to mean daily tweet sentiment, the ratio of postive to negative tweets may be interesting as it give's a sense of the the volume of activity.
def sentiment_category(sentiment_score):
if sentiment_score >= 0.05:
return "positive"
if (sentiment_score <0.05) and (sentiment_score>-0.05):
return "neutral"
if (sentiment_score <=-0.05):
return "negative"
sentiment = df.apply(lambda row: sentiment_category(row['sentiment_score']),axis=1)
df['sentiment']=sentiment
df.head()
df_sentiment_onehot = pd.get_dummies(df['sentiment'])
df=df.join(df_sentiment_onehot)
daily_count_df = df[['created_at','negative','neutral','positive']].resample('D',on='created_at').sum()
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20, 3))
plt.plot(daily_count_df.index, daily_count_df.negative, color = 'red')
plt.plot(daily_count_df.index, daily_count_df.positive, color = 'green')
plt.plot(daily_count_df.index, daily_count_df.neutral, color = 'blue')
plt.title("Tweet Sentiment - Count of sentiment type")
#plt.text('2019-08-02', 0.05,"EY minutes released")
plt.legend()
The graph above doesn't tell us a lot over a one week time period. However it would likely be more compelling over a longer timeframe, such as six months. The relative impact of major events on twitter traffic could be measured in this way.
Summarise most frequent words used¶
Above we've used hashtags as a way of summarising the topics that are driving positive or negative sentiment. A more general approach is to see which words appear most frequently. The NLTK library helps to structure the text and the WordCloud package provides a nice visual representation.
Preprocessing is required to remove 'stop words' (words too common to provide any meaningful information) as well as punctuation, numbers and links.
!conda install -c conda-forge wordcloud==1.4.1 --yes
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction import text
my_additional_stop_word_list = ["amp", "bank","banking", "nab"]
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_word_list)
def clean_tweets(tweet):
user_removed = re.sub(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9-_]+)','',tweet)
link_removed = re.sub('https?://[A-Za-z0-9./]+','',user_removed)
number_removed = re.sub('[^a-zA-Z]', ' ', link_removed)
lower_case_tweet= number_removed.lower()
tok = WordPunctTokenizer()
words = tok.tokenize(lower_case_tweet)
clean_tweet = (' '.join(words)).strip()
return clean_tweet
clean = lambda x: clean_tweets(x)
tweet_text = pd.DataFrame(df.text.apply(clean))
tweet_text.head()
Now that we have a dataframe of pure tweet text we can count the instances of each word using the countvectorizer function from the Scikit-learn library:
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(tweet_text.text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = tweet_text.index
data_dtm.head()
tweet_words = data_dtm.transpose()
tweet_words['Total']=tweet_words.sum(axis=1)
tweet_words.sort_values('Total', ascending=False).head(10)
We now have a sorted list of word frequency, from most-used down to least-used. Lets write a function which turns a dataframe into a wordcloud, then take a look at a couple of wordcloud outputs; one for positive tweets and one for negative sentiment tweets only:
#function takes tweets in pandas series and outputs a word cloud
def tweetcloud(tweets):
#clean the tweets
tweet_text = pd.DataFrame(tweets.apply(clean))
#put all tweets into a string in preparation for wordcloud
all_text = " ".join(list(tweet_text['text']))
# instantiate a word cloud object
wc = WordCloud(background_color='white', max_words=2000, stopwords = stop_words)
# generate the word cloud
wc.generate(all_text)
# display the word cloud
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()
print('Top words from positive tweets:')
tweetcloud(df[df['positive']==1].text)
print('Top words from negative tweets:')
tweetcloud(df[df['negative']==1].text)