Dataist Dogma

Reflections and projects in Data Science, Machine Learning and AI "A critical examination of the Dataist dogma is likely to be not only the greatest scientific challenge of the twenty-first century, but also the most urgent political and economic project" - Yuval Noah Harari - Homo Deus: a Brief History of the Future (2016)

Social media sentiment analysis


Customer sentiment related to Australia's 'big 4' banks is arguably at an all-time low. The Banking Royal Commission put the spotlight on banking practices that fell well short of customer expectations and regulatory requirements, including charging financial advice fees for customers who were not receiving that service as well as charging insurance premuims for dead people. So when I wanted to develop an application to measure customer sentiment it made sense to use one of these banks as the source of my data.

The application below analyses a weeks worth of Tweets relating to one of these banks (NAB) to get an understanding of chaning sentiment over time, as well as the topics that are driving that sentiment. The application uses the Tweepy Twitter API to collect the tweets, the NLTK Natural Language Processing Toolkit to analyse the text and the Vader library to analyse sentiment.

The first step requires installing and importing required libraries and initialising credential for the Tweepy API:

In [1]:
!pip install tweepy nltk google-cloud-language python-telegram-bot vaderSentiment
Collecting tweepy
  Downloading https://files.pythonhosted.org/packages/36/1b/2bd38043d22ade352fc3d3902cf30ce0e2f4bf285be3b304a2782a767aec/tweepy-3.8.0-py2.py3-none-any.whl
Requirement already satisfied: nltk in /opt/conda/envs/Python36/lib/python3.6/site-packages (3.4)
Collecting google-cloud-language
  Downloading https://files.pythonhosted.org/packages/ba/b8/965a97ba60287910d342623da1da615254bded3e0965728cf7fc6339b7c8/google_cloud_language-1.3.0-py2.py3-none-any.whl (83kB)
     |████████████████████████████████| 92kB 7.2MB/s eta 0:00:011
Collecting python-telegram-bot
  Downloading https://files.pythonhosted.org/packages/84/6c/47932a4041ee76650ad1f45a80e1422077e1e99c08a4d7a61cfbe5393d41/python_telegram_bot-11.1.0-py2.py3-none-any.whl (326kB)
     |████████████████████████████████| 327kB 13.9MB/s eta 0:00:01
Collecting vaderSentiment
  Downloading https://files.pythonhosted.org/packages/86/9e/c53e1fc61aac5ee490a6ac5e21b1ac04e55a7c2aba647bb8411c9aadf24e/vaderSentiment-3.2.1-py2.py3-none-any.whl (125kB)
     |████████████████████████████████| 133kB 30.9MB/s eta 0:00:01
Collecting requests-oauthlib>=0.7.0 (from tweepy)
  Downloading https://files.pythonhosted.org/packages/c2/e2/9fd03d55ffb70fe51f587f20bcf407a6927eb121de86928b34d162f0b1ac/requests_oauthlib-1.2.0-py2.py3-none-any.whl
Requirement already satisfied: six>=1.10.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from tweepy) (1.12.0)
Requirement already satisfied: requests>=2.11.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from tweepy) (2.21.0)
Requirement already satisfied: PySocks>=1.5.7 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from tweepy) (1.6.8)
Requirement already satisfied: singledispatch in /opt/conda/envs/Python36/lib/python3.6/site-packages (from nltk) (3.4.0.3)
Collecting google-api-core[grpc]<2.0.0dev,>=1.14.0 (from google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/71/e5/7059475b3013a3c75abe35015c5761735ab224eb1b129fee7c8e376e7805/google_api_core-1.14.2-py2.py3-none-any.whl (68kB)
     |████████████████████████████████| 71kB 29.5MB/s eta 0:00:01
Requirement already satisfied: certifi in /opt/conda/envs/Python36/lib/python3.6/site-packages (from python-telegram-bot) (2019.6.16)
Requirement already satisfied: future>=0.16.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from python-telegram-bot) (0.17.1)
Requirement already satisfied: cryptography in /opt/conda/envs/Python36/lib/python3.6/site-packages (from python-telegram-bot) (2.5)
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->tweepy)
  Downloading https://files.pythonhosted.org/packages/05/57/ce2e7a8fa7c0afb54a0581b14a65b56e62b5759dbc98e80627142b8a3704/oauthlib-3.1.0-py2.py3-none-any.whl (147kB)
     |████████████████████████████████| 153kB 36.4MB/s eta 0:00:01
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests>=2.11.1->tweepy) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests>=2.11.1->tweepy) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests>=2.11.1->tweepy) (1.24.1)
Requirement already satisfied: pytz in /opt/conda/envs/Python36/lib/python3.6/site-packages (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language) (2018.9)
Collecting google-auth<2.0dev,>=0.4.0 (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/c5/9b/ed0516cc1f7609fb0217e3057ff4f0f9f3e3ce79a369c6af4a6c5ca25664/google_auth-1.6.3-py2.py3-none-any.whl (73kB)
     |████████████████████████████████| 81kB 30.5MB/s eta 0:00:01
Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language) (3.6.1)
Collecting googleapis-common-protos<2.0dev,>=1.6.0 (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/eb/ee/e59e74ecac678a14d6abefb9054f0bbcb318a6452a30df3776f133886d7d/googleapis-common-protos-1.6.0.tar.gz
Requirement already satisfied: setuptools>=34.0.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language) (40.8.0)
Requirement already satisfied: grpcio<2.0dev,>=1.8.2; extra == "grpc" in /opt/conda/envs/Python36/lib/python3.6/site-packages (from google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language) (1.16.1)
Requirement already satisfied: asn1crypto>=0.21.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from cryptography->python-telegram-bot) (0.24.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from cryptography->python-telegram-bot) (1.11.5)
Collecting rsa>=3.1.4 (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/02/e5/38518af393f7c214357079ce67a317307936896e961e35450b70fad2a9cf/rsa-4.0-py2.py3-none-any.whl
Collecting pyasn1-modules>=0.2.1 (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/be/70/e5ea8afd6d08a4b99ebfc77bd1845248d56cfcf43d11f9dc324b9580a35c/pyasn1_modules-0.2.6-py2.py3-none-any.whl (95kB)
     |████████████████████████████████| 102kB 23.2MB/s ta 0:00:01
Collecting cachetools>=2.0.0 (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/2f/a6/30b0a0bef12283e83e58c1d6e7b5aabc7acfc4110df81a4471655d33e704/cachetools-3.1.1-py2.py3-none-any.whl
Requirement already satisfied: pycparser in /opt/conda/envs/Python36/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography->python-telegram-bot) (2.19)
Collecting pyasn1>=0.1.3 (from rsa>=3.1.4->google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.14.0->google-cloud-language)
  Downloading https://files.pythonhosted.org/packages/6a/6e/209351ec34b7d7807342e2bb6ff8a96eef1fd5dcac13bdbadf065c2bb55c/pyasn1-0.4.6-py2.py3-none-any.whl (75kB)
     |████████████████████████████████| 81kB 25.0MB/s eta 0:00:01
Building wheels for collected packages: googleapis-common-protos
  Building wheel for googleapis-common-protos (setup.py) ... done
  Stored in directory: /home/dsxuser/.cache/pip/wheels/9e/3d/a2/1bec8bb7db80ab3216dbc33092bb7ccd0debfb8ba42b5668d5
Successfully built googleapis-common-protos
Installing collected packages: oauthlib, requests-oauthlib, tweepy, pyasn1, rsa, pyasn1-modules, cachetools, google-auth, googleapis-common-protos, google-api-core, google-cloud-language, python-telegram-bot, vaderSentiment
Successfully installed cachetools-3.1.1 google-api-core-1.14.2 google-auth-1.6.3 google-cloud-language-1.3.0 googleapis-common-protos-1.6.0 oauthlib-3.1.0 pyasn1-0.4.6 pyasn1-modules-0.2.6 python-telegram-bot-11.1.0 requests-oauthlib-1.2.0 rsa-4.0 tweepy-3.8.0 vaderSentiment-3.2.1
In [ ]:
from tweepy import OAuthHandler
import tweepy
import pandas as pd
import re
from nltk.tokenize import WordPunctTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from datetime import datetime, timedelta
In [2]:
ACCESS_TOKEN = "1156516068988338176-DgBj98sjSOd2at1x07q7mTob9aSrSC"
ACCESS_TOKEN_SECRET = "vsaYYKi8NjSxDdsLHM3dDTjKkltuJhqUgZkfLMk6ffBNk"
CONSUMER_KEY="XfLdl1oZFEguUDD1eJApOFhW8"
CONSUMER_SECRET="9JT3gEHA7g7yusWVA3NjKKIZhbmA7IOcovBc63DVOgOdmiaoy0"

Pull tweets into a dataframe

Next we call the API with the keyword @NAB to search for all tweets related to the bank. Twitter only allows a week or two of tweets to be extracted without a premuim account so we'll settle with a start date of 7 days prior to today and a maximum of 10000 tweets.

In [4]:
auth=OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN,ACCESS_TOKEN_SECRET)
api=tweepy.API(auth)
In [36]:
keyword = '@NAB -filter:retweets'
total_tweets = 10000

def search_tweets(keyword, total_tweets):
    today_datetime = datetime.today().now()
    start_datetime = today_datetime - timedelta(days=7)
    today_date = today_datetime.strftime('%Y-%m-%d')
    start_date = start_datetime.strftime('%Y-%m-%d')
    search_result = tweepy.Cursor(api.search, 
                                  tweet_mode='extended',
                                  q=keyword, 
                                  since=start_date, 
                                  result_type='recent', 
                                  lang='en').items(total_tweets)
    return search_result

tweets=search_tweets(keyword,total_tweets)

We create a list of all attributes of each tweet:

tweetlist = [[tweet.id, tweet.full_text, tweet.retweet_count, tweet.favorite_count, tweet.source, tweet.created_at, tweet.user.id, tweet.user.screen_name, tweet.user.name, tweet.user.created_at, tweet.user.description, tweet.user.followers_count, tweet.user.friends_count, tweet.user.location, tweet.user.time_zone] for tweet in tweets]

Then pull that into a dataframe for analysis:

In [91]:
df = pd.DataFrame(data=tweetlist, columns=['id','text','retweets','favorite_count','source','created_at','userid','username','name','user_joined','user_desc','user_followers','user_friends','user_location','user_timezone'])
In [92]:
print(df.shape)
df.head()
(414, 15)
Out[92]:
id text retweets favorite_count source created_at userid username name user_joined user_desc user_followers user_friends user_location user_timezone
0 1164515820254351360 AFL 2019: Sydney Swans likely to re-sign Qatar... 0 0 Twitter Web Client 2019-08-22 12:33:01 192467705 mj_lynch Martin Lynch 2010-09-19 06:50:56 Go Parra #AFLinCrisis 341 1352 Australia None
1 1164515778583912448 @anita_bonitanz @Bargey @TheBachelorAU @NAB No... 0 1 Twitter for iPhone 2019-08-22 12:32:51 17430829 pdub pdub 2008-11-17 00:49:48 [UL] MarketingMgr @LocalCoinSwap_ #AmoLixoElec... 1287 592 World Citizen None
2 1164515452359303168 AFL 2019 concussion: SCAT 5 test flawed says P... 0 0 Twitter Web Client 2019-08-22 12:31:34 192467705 mj_lynch Martin Lynch 2010-09-19 06:50:56 Go Parra #AFLinCrisis 341 1352 Australia None
3 1164510825794510850 @NAB And NAB your money while we do it ... 0 0 Twitter for iPhone 2019-08-22 12:13:11 419410058 SerenaGuild Serenaz 2011-11-23 09:46:05 Typical Scorpio so beware my sting ... semi re... 291 413 New South Wales, Australia None
4 1164500134123753473 @davidduffycybg @clydesdalebank @PhilipChronic... 1 1 Twitter for iPhone 2019-08-22 11:30:41 1127245088864731136 winwin91518639 winwin 2019-05-11 16:12:27 Rise like Lions after slumber In unvanquishabl... 116 647 None

So we have 414 tweets in our data set

Add sentiment rating for each tweet using Vader

We've used the Vader library for sentiment analysis. It's simple to use and performs fairly well on the shorthand text you get from Twitter.

In [93]:
#Initialise the analyser object:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

Let's see what the sentiment is for one random tweet:

In [94]:
print(df.text[26])
print('Sentiment =',analyser.polarity_scores(df.text[26]).get('compound'))
The @CSIRO Natural Capital Survey is giving you the chance to win prizes from @RydaDotCom and @AgDataSoftware: https://t.co/6EH04FLBPr @NAB #qldag #agchatoz
Sentiment = 0.9136

0.9136 is very postive. Run the sentiment analyser for the whole dataframe then look a few results:

In [95]:
sentiment_score = df.apply(lambda row: analyser.polarity_scores(row['text']).get('compound'), axis=1)
sentiment_score.head()
Out[95]:
0    0.0000
1   -0.0258
2    0.0000
3    0.0000
4   -0.2144
dtype: float64

In the first five tweets we have a couple of negative tweets and a three neutral tweets. We'll now add the sentiment as a column to the dataframe:

In [96]:
df['sentiment_score'] = sentiment_score
df.head()
Out[96]:
id text retweets favorite_count source created_at userid username name user_joined user_desc user_followers user_friends user_location user_timezone sentiment_score
0 1164515820254351360 AFL 2019: Sydney Swans likely to re-sign Qatar... 0 0 Twitter Web Client 2019-08-22 12:33:01 192467705 mj_lynch Martin Lynch 2010-09-19 06:50:56 Go Parra #AFLinCrisis 341 1352 Australia None 0.0000
1 1164515778583912448 @anita_bonitanz @Bargey @TheBachelorAU @NAB No... 0 1 Twitter for iPhone 2019-08-22 12:32:51 17430829 pdub pdub 2008-11-17 00:49:48 [UL] MarketingMgr @LocalCoinSwap_ #AmoLixoElec... 1287 592 World Citizen None -0.0258
2 1164515452359303168 AFL 2019 concussion: SCAT 5 test flawed says P... 0 0 Twitter Web Client 2019-08-22 12:31:34 192467705 mj_lynch Martin Lynch 2010-09-19 06:50:56 Go Parra #AFLinCrisis 341 1352 Australia None 0.0000
3 1164510825794510850 @NAB And NAB your money while we do it ... 0 0 Twitter for iPhone 2019-08-22 12:13:11 419410058 SerenaGuild Serenaz 2011-11-23 09:46:05 Typical Scorpio so beware my sting ... semi re... 291 413 New South Wales, Australia None 0.0000
4 1164500134123753473 @davidduffycybg @clydesdalebank @PhilipChronic... 1 1 Twitter for iPhone 2019-08-22 11:30:41 1127245088864731136 winwin91518639 winwin 2019-05-11 16:12:27 Rise like Lions after slumber In unvanquishabl... 116 647 None -0.2144

It will be useful later when we graph the sentiment for each day to be able to get the hashtags relating to that day. Then when we see a drop or a spike in sentiment we can look at the topics that drove that sentiment via the hashtags

In [97]:
#Function which takes a date and returns hashtags for that date
def gethashtags(date):
    
    #convert the date string to datetime in order to add one day, then convert back to string
    dt = datetime.strptime(date, "%Y-%m-%d")
    dt2 = dt + timedelta(days=1)
    date2 = datetime.strftime(dt2, "%Y-%m-%d")
    
    text = df[(df['created_at'] > date) & (df['created_at'] < date2)].text.to_string()
    hashtags = [word[0:] for word in text.split() if word[0] == '#']
    return hashtags

Lets see what the hashtags are for yesterday as an example:

In [98]:
gethashtags('2019-08-21')
Out[98]:
['#Australia', '#saving', '#money', '#DUBAILAGOON']

Graph mean daily sentiment

Let's now graph mean sentiment changes over the past week. First we'll group the tweets by day:

daily_df = df.resample('D',on='created_at').mean() daily_df.head()

In [99]:
import matplotlib.pyplot as plt
import numpy as np

Then we can plot the graph using the MatPlotLib library, adding a label showing the hashtags relevant for that day:

In [100]:
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20, 10))
plt.plot(daily_df.index, daily_df.sentiment_score)
plt.title("Tweet Sentiment - Mean daily sentiment")

#plot the hashtags associated with each day on the graph
for day in daily_df.index:
    daystring = day.strftime('%Y-%m-%d')
    plt.text(daystring, daily_df.loc[day]['sentiment_score'],gethashtags(daystring))

FinTech and #Mortgagestress relate to positive tweets regarding potential fintech solutions to the problem of mortgage stress.

The major negative tweets related to the Clydesdale bank 'Tailored Business Loan scam' which has attracted negative media attention

Graph total, positive and negative tweet counts

In addition to mean daily tweet sentiment, the ratio of postive to negative tweets may be interesting as it give's a sense of the the volume of activity.

In [101]:
def sentiment_category(sentiment_score):
    if sentiment_score >= 0.05:
        return "positive"
    if (sentiment_score <0.05) and (sentiment_score>-0.05):
        return "neutral"
    if (sentiment_score <=-0.05):
        return "negative"
    
In [102]:
sentiment = df.apply(lambda row: sentiment_category(row['sentiment_score']),axis=1)
df['sentiment']=sentiment
df.head()
Out[102]:
id text retweets favorite_count source created_at userid username name user_joined user_desc user_followers user_friends user_location user_timezone sentiment_score sentiment
0 1164515820254351360 AFL 2019: Sydney Swans likely to re-sign Qatar... 0 0 Twitter Web Client 2019-08-22 12:33:01 192467705 mj_lynch Martin Lynch 2010-09-19 06:50:56 Go Parra #AFLinCrisis 341 1352 Australia None 0.0000 neutral
1 1164515778583912448 @anita_bonitanz @Bargey @TheBachelorAU @NAB No... 0 1 Twitter for iPhone 2019-08-22 12:32:51 17430829 pdub pdub 2008-11-17 00:49:48 [UL] MarketingMgr @LocalCoinSwap_ #AmoLixoElec... 1287 592 World Citizen None -0.0258 neutral
2 1164515452359303168 AFL 2019 concussion: SCAT 5 test flawed says P... 0 0 Twitter Web Client 2019-08-22 12:31:34 192467705 mj_lynch Martin Lynch 2010-09-19 06:50:56 Go Parra #AFLinCrisis 341 1352 Australia None 0.0000 neutral
3 1164510825794510850 @NAB And NAB your money while we do it ... 0 0 Twitter for iPhone 2019-08-22 12:13:11 419410058 SerenaGuild Serenaz 2011-11-23 09:46:05 Typical Scorpio so beware my sting ... semi re... 291 413 New South Wales, Australia None 0.0000 neutral
4 1164500134123753473 @davidduffycybg @clydesdalebank @PhilipChronic... 1 1 Twitter for iPhone 2019-08-22 11:30:41 1127245088864731136 winwin91518639 winwin 2019-05-11 16:12:27 Rise like Lions after slumber In unvanquishabl... 116 647 None -0.2144 negative
In [103]:
df_sentiment_onehot = pd.get_dummies(df['sentiment'])
df=df.join(df_sentiment_onehot)
In [104]:
daily_count_df = df[['created_at','negative','neutral','positive']].resample('D',on='created_at').sum()
In [105]:
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20, 3))
plt.plot(daily_count_df.index, daily_count_df.negative, color = 'red')
plt.plot(daily_count_df.index, daily_count_df.positive, color = 'green')
plt.plot(daily_count_df.index, daily_count_df.neutral, color = 'blue')
plt.title("Tweet Sentiment - Count of sentiment type")
#plt.text('2019-08-02', 0.05,"EY minutes released")
plt.legend()
Out[105]:
<matplotlib.legend.Legend at 0x7fb19bbee208>

The graph above doesn't tell us a lot over a one week time period. However it would likely be more compelling over a longer timeframe, such as six months. The relative impact of major events on twitter traffic could be measured in this way.

Summarise most frequent words used

Above we've used hashtags as a way of summarising the topics that are driving positive or negative sentiment. A more general approach is to see which words appear most frequently. The NLTK library helps to structure the text and the WordCloud package provides a nice visual representation.

Preprocessing is required to remove 'stop words' (words too common to provide any meaningful information) as well as punctuation, numbers and links.

In [ ]:
!conda install -c conda-forge wordcloud==1.4.1 --yes
from wordcloud import WordCloud, STOPWORDS
In [106]:
from sklearn.feature_extraction import text 
my_additional_stop_word_list = ["amp", "bank","banking", "nab"]
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_word_list)
In [56]:
def clean_tweets(tweet):
    user_removed = re.sub(r'(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9-_]+)','',tweet)
    link_removed = re.sub('https?://[A-Za-z0-9./]+','',user_removed)
    number_removed = re.sub('[^a-zA-Z]', ' ', link_removed)
    lower_case_tweet= number_removed.lower()
    tok = WordPunctTokenizer()
    words = tok.tokenize(lower_case_tweet)
    clean_tweet = (' '.join(words)).strip()
    return clean_tweet

clean = lambda x: clean_tweets(x)
In [107]:
tweet_text = pd.DataFrame(df.text.apply(clean))
tweet_text.head()
Out[107]:
text
0 afl sydney swans likely to re sign qatar airwa...
1 no i m certain on the footage they showed they...
2 afl concussion scat test flawed says peter jes...
3 and nab your money while we do it
4 stealing customers assets estates is not busin...

Now that we have a dataframe of pure tweet text we can count the instances of each word using the countvectorizer function from the Scikit-learn library:

In [109]:
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(tweet_text.text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = tweet_text.index
data_dtm.head()
Out[109]:
aag aap abandon abb abc abide ability able absolve abuse ... yields youl young youngleaders yourwealth youthias zealand zelman zero zulfi
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 2205 columns

In [114]:
tweet_words = data_dtm.transpose()
tweet_words['Total']=tweet_words.sum(axis=1)
tweet_words.sort_values('Total', ascending=False).head(10)
Out[114]:
0 1 2 3 4 5 6 7 8 9 ... 405 406 407 408 409 410 411 412 413 Total
just 0 1 0 0 0 0 0 0 0 0 ... 1 0 1 1 0 1 1 0 1 25
afl 2 0 2 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 24
account 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 22
money 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20
app 0 0 0 0 0 0 0 2 0 0 ... 0 0 0 0 0 0 0 0 0 18
don 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 18
like 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 17
going 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 17
fraud 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 17
banks 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 17

10 rows × 415 columns

We now have a sorted list of word frequency, from most-used down to least-used. Lets write a function which turns a dataframe into a wordcloud, then take a look at a couple of wordcloud outputs; one for positive tweets and one for negative sentiment tweets only:

In [122]:
#function takes tweets in pandas series and outputs a word cloud
def tweetcloud(tweets):
    #clean the tweets
    tweet_text = pd.DataFrame(tweets.apply(clean))
    #put all tweets into a string in preparation for wordcloud
    all_text = " ".join(list(tweet_text['text']))
    
    
    # instantiate a word cloud object
    wc = WordCloud(background_color='white', max_words=2000, stopwords = stop_words)

    # generate the word cloud
    wc.generate(all_text)
    
    # display the word cloud
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.show()
In [125]:
print('Top words from positive tweets:')
tweetcloud(df[df['positive']==1].text)
Top words from positive tweets:
In [126]:
print('Top words from negative tweets:')
tweetcloud(df[df['negative']==1].text)
Top words from negative tweets: