Customer sentiment related to Australia's 'big 4' banks is arguably at an all-time low. The Banking Royal Commission put the spotlight on banking practices that fell well short of customer expectations and regulatory requirements, including charging financial advice fees for customers who were not receiving that service as well as charging insurance premuims for dead people. So when I wanted to develop an application to measure customer sentiment it made sense to use one of these banks as the source of my data.

The application below analyses a weeks worth of Tweets relating to one of these banks (NAB) to get an understanding of chaning sentiment over time, as well as the topics that are driving that sentiment. The application uses the Tweepy Twitter API to collect the tweets, the NLTK Natural Language Processing Toolkit to analyse the text and the Vader library to analyse sentiment.

The first step requires installing and importing required libraries and initialising credential for the Tweepy API:

In [1]:

!pip install tweepy nltk google-cloud-language python-telegram-bot vaderSentiment

Collecting tweepy
  Downloading https://files.pythonhosted.org/packages/36/1b/2bd38043d22ade352fc3d3902cf30ce0e2f4bf285be3b304a2782a767aec/tweepy-3.8.0-py2.py3-none-any.whl
Requirement already satisfied: nltk in /opt/conda/envs/Python36/lib/python3.6/site-packages (3.4)
Collecting google-cloud-language
  Downloading https://files.pythonhosted.org/packages/ba/b8/965a97ba60287910d342623da1da615254bded3e0965728cf7fc6339b7c8/google_cloud_language-1.3.0-py2.py3-none-any.whl (83kB)
     |████████████████████████████████| 92kB 7.2MB/s eta 0:00:011
Collecting python-telegram-bot
  Downloading https://files.pythonhosted.org/packages/84/6c/47932a4041ee76650ad1f45a80e1422077e1e99c08a4d7a61cfbe5393d41/python_telegram_bot-11.1.0-py2.py3-none-any.whl (326kB)
     |████████████████████████████████| 327kB 13.9MB/s eta 0:00:01
Collecting vaderSentiment
  Downloading https://files.pythonhosted.org/packages/86/9e/c53e1fc61aac5ee490a6ac5e21b1ac04e55a7c2aba647bb8411c9aadf24e/vaderSentiment-3.2.1-py2.py3-none-any.whl (125kB)
     |████████████████████████████████| 133kB 30.9MB/s eta 0:00:01
Collecting requests-oauthlib>=0.7.0 (from tweepy)
  Downloading https://files.pythonhosted.org/packages/c2/e2/9fd03d55ffb70fe51f587f20bcf407a6927eb121de86928b34d162f0b1ac/requests_oauthlib-1.2.0-py2.py3-none-any.whl
Requirement already satisfied: six>=1.10.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from tweepy) (1.12.0)
Requirement already satisfied: requests>=2.11.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from tweepy) (2.21.0)
Requirement already satisfied: PySocks>=1.5.7 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from tweepy) (1.6.8)
Requirement already satisfied: singledispatch in /opt/conda/envs/Python36/lib/python3.6/site-packages (from nltk) (3.4.0.3)
Collecting google-api-core[grpc]<2.0.0dev,>=1.14.0 (from google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/71/e5/7059475b3013a3c75abe35015c5761735ab224eb1b129fee7c8e376e7805/google_api_core-1.14.2-py2.py3-none-any.whl (68kB)
     |████████████████████████████████| 71kB 29.5MB/s eta 0:00:01
Requirement already satisfied: certifi in /opt/conda/envs/Python36/lib/python3.6/site-packages (from python-telegram-bot) (2019.6.16)
Requirement already satisfied: future>=0.16.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from python-telegram-bot) (0.17.1)
Requirement already satisfied: cryptography in /opt/conda/envs/Python36/lib/python3.6/site-packages (from python-telegram-bot) (2.5)
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->tweepy)
  Downloading https://files.pythonhosted.org/packages/05/57/ce2e7a8fa7c0afb54a0581b14a65b56e62b5759dbc98e80627142b8a3704/oauthlib-3.1.0-py2.py3-none-any.whl (147kB)
     |████████████████████████████████| 153kB 36.4MB/s eta 0:00:01
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests>=2.11.1->tweepy) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests>=2.11.1->tweepy) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests>=2.11.1->tweepy) (1.24.1)
Requirement already satisfied: pytz in /opt/conda/envs/Python36/lib/python3.6/site-packages (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language) (2018.9)
Collecting google-auth<2.0dev,>=0.4.0 (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/c5/9b/ed0516cc1f7609fb0217e3057ff4f0f9f3e3ce79a369c6af4a6c5ca25664/google_auth-1.6.3-py2.py3-none-any.whl (73kB)
     |████████████████████████████████| 81kB 30.5MB/s eta 0:00:01
Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language) (3.6.1)
Collecting googleapis-common-protos<2.0dev,>=1.6.0 (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/eb/ee/e59e74ecac678a14d6abefb9054f0bbcb318a6452a30df3776f133886d7d/googleapis-common-protos-1.6.0.tar.gz
Requirement already satisfied: setuptools>=34.0.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language) (40.8.0)
Requirement already satisfied: grpcio<2.0dev,>=1.8.2; extra == "grpc" in /opt/conda/envs/Python36/lib/python3.6/site-packages (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language) (1.16.1)
Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from cryptography->python-telegram-bot) (0.24.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from cryptography->python-telegram-bot) (1.11.5)
Collecting rsa>=3.1.4 (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/02/e5/38518af393f7c214357079ce67a317307936896e961e35450b70fad2a9cf/rsa-4.0-py2.py3-none-any.whl
Collecting pyasn1-modules>=0.2.1 (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/be/70/e5ea8afd6d08a4b99ebfc77bd1845248d56cfcf43d11f9dc324b9580a35c/pyasn1_modules-0.2.6-py2.py3-none-any.whl (95kB)
     |████████████████████████████████| 102kB 23.2MB/s ta 0:00:01
Collecting cachetools>=2.0.0 (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/2f/a6/30b0a0bef12283e83e58c1d6e7b5aabc7acfc4110df81a4471655d33e704/cachetools-3.1.1-py2.py3-none-any.whl
Requirement already satisfied: pycparser in /opt/conda/envs/Python36/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography->python-telegram-bot) (2.19)
Collecting pyasn1>=0.1.3 (from rsa>=3.1.4->google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/6a/6e/209351ec34b7d7807342e2bb6ff8a96eef1fd5dcac13bdbadf065c2bb55c/pyasn1-0.4.6-py2.py3-none-any.whl (75kB)
     |████████████████████████████████| 81kB 25.0MB/s eta 0:00:01
Building wheels for collected packages: googleapis-common-protos
  Building wheel for googleapis-common-protos (setup.py) ... done
  Stored in directory: /home/dsxuser/.cache/pip/wheels/9e/3d/a2/1bec8bb7db80ab3216dbc33092bb7ccd0debfb8ba42b5668d5
Successfully built googleapis-common-protos
Installing collected packages: oauthlib, requests-oauthlib, tweepy, pyasn1, rsa, pyasn1-modules, cachetools, google-auth, googleapis-common-protos, google-api-core, google-cloud-language, python-telegram-bot, vaderSentiment
Successfully installed cachetools-3.1.1 google-api-core-1.14.2 google-auth-1.6.3 google-cloud-language-1.3.0 googleapis-common-protos-1.6.0 oauthlib-3.1.0 pyasn1-0.4.6 pyasn1-modules-0.2.6 python-telegram-bot-11.1.0 requests-oauthlib-1.2.0 rsa-4.0 tweepy-3.8.0 vaderSentiment-3.2.1

In [ ]:

from tweepy import OAuthHandler
import tweepy
import pandas as pd
import re
from nltk.tokenize import WordPunctTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from datetime import datetime, timedelta

In [2]:

ACCESS_TOKEN = "1156516068988338176-DgBj98sjSOd2at1x07q7mTob9aSrSC"
ACCESS_TOKEN_SECRET = "vsaYYKi8NjSxDdsLHM3dDTjKkltuJhqUgZkfLMk6ffBNk"
CONSUMER_KEY="XfLdl1oZFEguUDD1eJApOFhW8"
CONSUMER_SECRET="9JT3gEHA7g7yusWVA3NjKKIZhbmA7IOcovBc63DVOgOdmiaoy0"

Pull tweets into a dataframe¶

Next we call the API with the keyword @NAB to search for all tweets related to the bank. Twitter only allows a week or two of tweets to be extracted without a premuim account so we'll settle with a start date of 7 days prior to today and a maximum of 10000 tweets.

In [4]:

auth=OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN,ACCESS_TOKEN_SECRET)
api=tweepy.API(auth)

In [36]:

keyword = '@NAB -filter:retweets'
total_tweets = 10000

def search_tweets(keyword, total_tweets):
    today_datetime = datetime.today().now()
    start_datetime = today_datetime - timedelta(days=7)
    today_date = today_datetime.strftime('%Y-%m-%d')
    start_date = start_datetime.strftime('%Y-%m-%d')
    search_result = tweepy.Cursor(api.search, 
                                  tweet_mode='extended',
                                  q=keyword, 
                                  since=start_date, 
                                  result_type='recent', 
                                  lang='en').items(total_tweets)
    return search_result

tweets=search_tweets(keyword,total_tweets)

We create a list of all attributes of each tweet:

tweetlist = [[tweet.id, tweet.full_text, tweet.retweet_count, tweet.favorite_count, tweet.source, tweet.created_at, tweet.user.id, tweet.user.screen_name, tweet.user.name, tweet.user.created_at, tweet.user.description, tweet.user.followers_count, tweet.user.friends_count, tweet.user.location, tweet.user.time_zone] for tweet in tweets]

Then pull that into a dataframe for analysis:

In [91]:

df = pd.DataFrame(data=tweetlist, columns=['id','text','retweets','favorite_count','source','created_at','userid','username','name','user_joined','user_desc','user_followers','user_friends','user_location','user_timezone'])

In [92]:

print(df.shape)
df.head()

(414, 15)

Out[92]:

	id	text	retweets	favorite_count	source	created_at	userid	username	name	user_joined	user_desc	user_followers	user_friends	user_location	user_timezone
0	1164515820254351360	AFL 2019: Sydney Swans likely to re-sign Qatar...	0	0	Twitter Web Client	2019-08-22 12:33:01	192467705	mj_lynch	Martin Lynch	2010-09-19 06:50:56	Go Parra #AFLinCrisis	341	1352	Australia	None
1	1164515778583912448	@anita_bonitanz @Bargey @TheBachelorAU @NAB No...	0	1	Twitter for iPhone	2019-08-22 12:32:51	17430829	pdub	pdub	2008-11-17 00:49:48	[UL] MarketingMgr @LocalCoinSwap_ #AmoLixoElec...	1287	592	World Citizen	None
2	1164515452359303168	AFL 2019 concussion: SCAT 5 test flawed says P...	0	0	Twitter Web Client	2019-08-22 12:31:34	192467705	mj_lynch	Martin Lynch	2010-09-19 06:50:56	Go Parra #AFLinCrisis	341	1352	Australia	None
3	1164510825794510850	@NAB And NAB your money while we do it ...	0	0	Twitter for iPhone	2019-08-22 12:13:11	419410058	SerenaGuild	Serenaz	2011-11-23 09:46:05	Typical Scorpio so beware my sting ... semi re...	291	413	New South Wales, Australia	None
4	1164500134123753473	@davidduffycybg @clydesdalebank @PhilipChronic...	1	1	Twitter for iPhone	2019-08-22 11:30:41	1127245088864731136	winwin91518639	winwin	2019-05-11 16:12:27	Rise like Lions after slumber In unvanquishabl...	116	647		None

So we have 414 tweets in our data set

Add sentiment rating for each tweet using Vader¶

We've used the Vader library for sentiment analysis. It's simple to use and performs fairly well on the shorthand text you get from Twitter.

In [93]:

#Initialise the analyser object:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

Let's see what the sentiment is for one random tweet:

In [94]:

print(df.text[26])
print('Sentiment =',analyser.polarity_scores(df.text[26]).get('compound'))

The @CSIRO Natural Capital Survey is giving you the chance to win prizes from @RydaDotCom and @AgDataSoftware: https://t.co/6EH04FLBPr @NAB #qldag #agchatoz
Sentiment = 0.9136

0.9136 is very postive. Run the sentiment analyser for the whole dataframe then look a few results:

In [95]:

sentiment_score = df.apply(lambda row: analyser.polarity_scores(row['text']).get('compound'), axis=1)
sentiment_score.head()

Out[95]:

0    0.0000
1   -0.0258
2    0.0000
3    0.0000
4   -0.2144
dtype: float64

In the first five tweets we have a couple of negative tweets and a three neutral tweets. We'll now add the sentiment as a column to the dataframe:

In [96]:

df['sentiment_score'] = sentiment_score
df.head()

Out[96]:

	id	text	retweets	favorite_count	source	created_at	userid	username	name	user_joined	user_desc	user_followers	user_friends	user_location	user_timezone	sentiment_score
0	1164515820254351360	AFL 2019: Sydney Swans likely to re-sign Qatar...	0	0	Twitter Web Client	2019-08-22 12:33:01	192467705	mj_lynch	Martin Lynch	2010-09-19 06:50:56	Go Parra #AFLinCrisis	341	1352	Australia	None	0.0000
1	1164515778583912448	@anita_bonitanz @Bargey @TheBachelorAU @NAB No...	0	1	Twitter for iPhone	2019-08-22 12:32:51	17430829	pdub	pdub	2008-11-17 00:49:48	[UL] MarketingMgr @LocalCoinSwap_ #AmoLixoElec...	1287	592	World Citizen	None	-0.0258
2	1164515452359303168	AFL 2019 concussion: SCAT 5 test flawed says P...	0	0	Twitter Web Client	2019-08-22 12:31:34	192467705	mj_lynch	Martin Lynch	2010-09-19 06:50:56	Go Parra #AFLinCrisis	341	1352	Australia	None	0.0000
3	1164510825794510850	@NAB And NAB your money while we do it ...	0	0	Twitter for iPhone	2019-08-22 12:13:11	419410058	SerenaGuild	Serenaz	2011-11-23 09:46:05	Typical Scorpio so beware my sting ... semi re...	291	413	New South Wales, Australia	None	0.0000
4	1164500134123753473	@davidduffycybg @clydesdalebank @PhilipChronic...	1	1	Twitter for iPhone	2019-08-22 11:30:41	1127245088864731136	winwin91518639	winwin	2019-05-11 16:12:27	Rise like Lions after slumber In unvanquishabl...	116	647		None	-0.2144

It will be useful later when we graph the sentiment for each day to be able to get the hashtags relating to that day. Then when we see a drop or a spike in sentiment we can look at the topics that drove that sentiment via the hashtags

In [97]:

#Function which takes a date and returns hashtags for that date
def gethashtags(date):
    
    #convert the date string to datetime in order to add one day, then convert back to string
    dt = datetime.strptime(date, "%Y-%m-%d")
    dt2 = dt + timedelta(days=1)
    date2 = datetime.strftime(dt2, "%Y-%m-%d")
    
    text = df[(df['created_at'] > date) & (df['created_at'] < date2)].text.to_string()
    hashtags = [word[0:] for word in text.split() if word[0] == '#']
    return hashtags

Lets see what the hashtags are for yesterday as an example:

In [98]:

gethashtags('2019-08-21')

Out[98]:

['#Australia', '#saving', '#money', '#DUBAILAGOON']

Graph mean daily sentiment¶

Let's now graph mean sentiment changes over the past week. First we'll group the tweets by day:

daily_df = df.resample('D',on='created_at').mean() daily_df.head()

In [99]:

import matplotlib.pyplot as plt
import numpy as np

Then we can plot the graph using the MatPlotLib library, adding a label showing the hashtags relevant for that day:

In [100]:

plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20, 10))
plt.plot(daily_df.index, daily_df.sentiment_score)
plt.title("Tweet Sentiment - Mean daily sentiment")

#plot the hashtags associated with each day on the graph
for day in daily_df.index:
    daystring = day.strftime('%Y-%m-%d')
    plt.text(daystring, daily_df.loc[day]['sentiment_score'],gethashtags(daystring))

FinTech and #Mortgagestress relate to positive tweets regarding potential fintech solutions to the problem of mortgage stress.¶

The major negative tweets related to the Clydesdale bank 'Tailored Business Loan scam' which has attracted negative media attention

Graph total, positive and negative tweet counts¶

In addition to mean daily tweet sentiment, the ratio of postive to negative tweets may be interesting as it give's a sense of the the volume of activity.

In [101]:

def sentiment_category(sentiment_score):
    if sentiment_score >= 0.05:
        return "positive"
    if (sentiment_score <0.05) and (sentiment_score>-0.05):
        return "neutral"
    if (sentiment_score <=-0.05):
        return "negative"

In [102]:

sentiment = df.apply(lambda row: sentiment_category(row['sentiment_score']),axis=1)
df['sentiment']=sentiment
df.head()

Out[102]:

	id	text	retweets	favorite_count	source	created_at	userid	username	name	user_joined	user_desc	user_followers	user_friends	user_location	user_timezone	sentiment_score	sentiment
0	1164515820254351360	AFL 2019: Sydney Swans likely to re-sign Qatar...	0	0	Twitter Web Client	2019-08-22 12:33:01	192467705	mj_lynch	Martin Lynch	2010-09-19 06:50:56	Go Parra #AFLinCrisis	341	1352	Australia	None	0.0000	neutral
1	1164515778583912448	@anita_bonitanz @Bargey @TheBachelorAU @NAB No...	0	1	Twitter for iPhone	2019-08-22 12:32:51	17430829	pdub	pdub	2008-11-17 00:49:48	[UL] MarketingMgr @LocalCoinSwap_ #AmoLixoElec...	1287	592	World Citizen	None	-0.0258	neutral
2	1164515452359303168	AFL 2019 concussion: SCAT 5 test flawed says P...	0	0	Twitter Web Client	2019-08-22 12:31:34	192467705	mj_lynch	Martin Lynch	2010-09-19 06:50:56	Go Parra #AFLinCrisis	341	1352	Australia	None	0.0000	neutral
3	1164510825794510850	@NAB And NAB your money while we do it ...	0	0	Twitter for iPhone	2019-08-22 12:13:11	419410058	SerenaGuild	Serenaz	2011-11-23 09:46:05	Typical Scorpio so beware my sting ... semi re...	291	413	New South Wales, Australia	None	0.0000	neutral
4	1164500134123753473	@davidduffycybg @clydesdalebank @PhilipChronic...	1	1	Twitter for iPhone	2019-08-22 11:30:41	1127245088864731136	winwin91518639	winwin	2019-05-11 16:12:27	Rise like Lions after slumber In unvanquishabl...	116	647		None	-0.2144	negative

In [103]:

df_sentiment_onehot = pd.get_dummies(df['sentiment'])
df=df.join(df_sentiment_onehot)

In [104]:

daily_count_df = df[['created_at','negative','neutral','positive']].resample('D',on='created_at').sum()

In [105]:

plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20, 3))
plt.plot(daily_count_df.index, daily_count_df.negative, color = 'red')
plt.plot(daily_count_df.index, daily_count_df.positive, color = 'green')
plt.plot(daily_count_df.index, daily_count_df.neutral, color = 'blue')
plt.title("Tweet Sentiment - Count of sentiment type")
#plt.text('2019-08-02', 0.05,"EY minutes released")
plt.legend()

Out[105]:

<matplotlib.legend.Legend at 0x7fb19bbee208>

The graph above doesn't tell us a lot over a one week time period. However it would likely be more compelling over a longer timeframe, such as six months. The relative impact of major events on twitter traffic could be measured in this way.

Summarise most frequent words used¶

Above we've used hashtags as a way of summarising the topics that are driving positive or negative sentiment. A more general approach is to see which words appear most frequently. The NLTK library helps to structure the text and the WordCloud package provides a nice visual representation.

Preprocessing is required to remove 'stop words' (words too common to provide any meaningful information) as well as punctuation, numbers and links.

In [ ]:

!conda install -c conda-forge wordcloud==1.4.1 --yes
from wordcloud import WordCloud, STOPWORDS

In [106]:

from sklearn.feature_extraction import text 
my_additional_stop_word_list = ["amp", "bank","banking", "nab"]
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_word_list)

In [56]:

def clean_tweets(tweet):
    user_removed = re.sub(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9-_]+)','',tweet)
    link_removed = re.sub('https?://[A-Za-z0-9./]+','',user_removed)
    number_removed = re.sub('[^a-zA-Z]', ' ', link_removed)
    lower_case_tweet= number_removed.lower()
    tok = WordPunctTokenizer()
    words = tok.tokenize(lower_case_tweet)
    clean_tweet = (' '.join(words)).strip()
    return clean_tweet

clean = lambda x: clean_tweets(x)

In [107]:

tweet_text = pd.DataFrame(df.text.apply(clean))
tweet_text.head()

Out[107]:

	text
0	afl sydney swans likely to re sign qatar airwa...
1	no i m certain on the footage they showed they...
2	afl concussion scat test flawed says peter jes...
3	and nab your money while we do it
4	stealing customers assets estates is not busin...

Now that we have a dataframe of pure tweet text we can count the instances of each word using the countvectorizer function from the Scikit-learn library:

In [109]:

cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(tweet_text.text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = tweet_text.index
data_dtm.head()

Out[109]:

	aag	aap	abandon	abb	abc	abide	ability	able	absolve	abuse	...	yields	youl	young	youngleaders	yourwealth	youthias	zealand	zelman	zero	zulfi
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 2205 columns

In [114]:

tweet_words = data_dtm.transpose()
tweet_words['Total']=tweet_words.sum(axis=1)
tweet_words.sort_values('Total', ascending=False).head(10)

Out[114]:

	0	1	2	3	6	7	...	405	407	408	410	411	412	413	Total
just	0	1	0	0	0	0	...	1	1	1	1	1	0	1	25
afl	2	0	2	0	0	0	...	0	0	0	0	0	0	0	24
account	0	0	0	0	0	0	...	0	0	0	0	1	1	0	22
money	0	0	0	1	0	0	...	0	0	0	0	0	0	0	20
app	0	0	0	0	0	2	...	0	0	0	0	0	0	0	18
don	0	0	0	0	0	0	...	0	0	0	0	0	0	0	18
like	0	0	0	0	0	0	...	0	0	0	0	0	0	0	17
going	0	0	0	0	0	0	...	0	0	0	0	0	0	0	17
fraud	0	0	0	0	0	0	...	0	0	0	0	0	0	0	17
banks	0	0	0	0	1	0	...	0	0	0	0	0	0	0	17

10 rows × 415 columns

We now have a sorted list of word frequency, from most-used down to least-used. Lets write a function which turns a dataframe into a wordcloud, then take a look at a couple of wordcloud outputs; one for positive tweets and one for negative sentiment tweets only:

In [122]:

#function takes tweets in pandas series and outputs a word cloud
def tweetcloud(tweets):
    #clean the tweets
    tweet_text = pd.DataFrame(tweets.apply(clean))
    #put all tweets into a string in preparation for wordcloud
    all_text = " ".join(list(tweet_text['text']))
    
    
    # instantiate a word cloud object
    wc = WordCloud(background_color='white', max_words=2000, stopwords = stop_words)

    # generate the word cloud
    wc.generate(all_text)
    
    # display the word cloud
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.show()

In [125]:

print('Top words from positive tweets:')
tweetcloud(df[df['positive']==1].text)

Top words from positive tweets:

In [126]:

print('Top words from negative tweets:')
tweetcloud(df[df['negative']==1].text)

Top words from negative tweets:

Dataist Dogma

Social media sentiment analysis