Creating a Wordcloud with the Twitter Api in R Studio

May 14, 2018 By Pascal Schmidt Other R

In this blog post, we are going to show you how you can easily create a twitter wordcloud.

 

Connecting with the Twitter API

In order to get data from twitter into R, we need the API key, the API
secret, the Access token and the Access token secret. So first, sign up for a twitter account if you haven’t already and make sure that your mobile phone number is associated with account. Then go to https://apps.twitter.com/  and sign in. Afterwards,  click on “Create New App” and fill out this form:

In order to be able to fill out this form, you must create your own website. You can create a free wordpress website for example.  Click on the “Yes, I agree” box and then click on “Create your Twitter application”. Click on the “Permissions” tab and change the permission to “Read Write and Access direct messages”. Click on “Keys and Access Tokens” tab to generate your Consumer Key (API key) and Consumer Secret (API secret), Access Token and Access Token Secret…

… and you are done. Now back to R.

library(twitteR) 
library(ROAuth) 
library(stringr) 
library(wordcloud) 
library(twitteR) 

consumer_key = "your consumer key" 
consumer_secret = "your consumer secret" 
access_token = "your access token" 
access_secret = "your access secret" 
setup_twitter_oauth(consumer_key, 
                    consumer_secret, 
                    access_token, 
                    access_secret) 

## [1] "Using direct authentication" # put 2 in the R console

Put in your own consumer key, your consumer secret, your access token, and your access secret and you are good to analyse twitter data!

 

Pulling Data From Twitter

Tweets = searchTwitter(searchString = "worldcup -filter:retweets", 
                         n = 2000, 
                         lang = "en")

In the code above, we want to get tweets that include the word “”worldcup”. We do not want to include retweets in our data. We want to get 2000 tweets and specify the language to be english.

TweetsDF = twListToDF(Tweets)

Now, we have to convert the tweets we got into a data frame.

text = gsub("http[s]?://[[:alnum:].\\/]+", "", dsTweetsDF$text) 
# remove urls

text = gsub("(?!(#|@))[[:punct:]]", "", text, perl = T) 
# remove all punctuations except # and @. 

text = gsub("[[:cntrl:]]", "", text)
words = unlist(strsplit(text, " "))

hashtags = grep("^#\\w+", unlist(strsplit(text, " ")), value = T) 

# ^ -> matches the start of a string                                  
# w+ means that it matches one word character or more than one word                                                                      character. 

handles = grep("^@\\w+", unlist(strsplit(text, " ")), value = T)

hashtags.freq = table(hashtags)
handle.freq = table(handles)

After we have done some data cleaning, we are now ready to create a wordcloud with only hashtags and only handles.

 

Creating Wordclouds

wordcloud(names(hashtags.freq), 
          hashtags.freq, min.freq = 4, 
          colors = rainbow(8), random.order = FALSE)
Twitter wordcloud from worldcup tweets
Twitter wordcloud from worldcup tweets (hashtags)
wordcloud(names(handle.freq), 
          handle.freq, 
          min.freq = 4, 
          colors = rainbow(8), 
          random.order = FALSE)
Worldcup Twitter wordcloud
worldcup twitter wordcloud (handles)
text = gsub("@\\w+", " ", dsTweetsDF$text)
# removes all the handles in the tweets

text = gsub("(?!')[[:punct:]]", "", text, perl = T)
# removes all the punctuation except apostrophe

text = gsub("[[:cntrl:]]", "", text)
# removes all the control chracters, like \n or \r

text = gsub("[[:digit:]]", "", text)
# removes numbers

text = gsub("http\\w+", "", text)
# removes url links

text = gsub("[ \t]{2,}", " ", text)
text = gsub("^\\s+|\\s+$", "", text)
# remove unnecessary spaces

words = strsplit(text, " ")
# split into words
words = unlist(words)
words = words[!words %in% tm::stopwords(kind = "english")]
wordcloud(names(table(words)),table(words),min.freq=15,colors=rainbow(8))
twitter wordcloud
twitter wordcloud

We used the tm package to exclude stop words from our tweets before we created the wordcloud.

Post your comment