We have done popularity comparison using tweets from twitter. Now we will try something interesting. We will create word cloud from this. This is also an example of text ming using R
Whatever we have in popularity comparison, we need to follow same steps to connect R with Twitter for fetching data.
Refer this post and up to step-3 do as it is. Because, in this task also, we are taking tweets as input and creating a word cloud by analyzing that tweets.
Packages to be used:
Install and include these libraries in your workspace as we will be requiring all of these in the creation of word cloud.
“bitops”, “RCurl”, “RJSONIO”, “twitter”, “ROAuth”, “tm”, “stringr”, “wordcloud”
Next Steps to be followed:
Test your connectivity:
We have already created TweetFrame() function, that takes search term and maximum tweet limit and returns dataframe sorted on the basis of their arrival time.
Use this function to get the tweets of desired hashtags. For example I have done for Samsung Mobiles.
But these tweets will contain lots of junk that we do not require to show in word cloud. Hence we need to filter the data. To accomplish this, we have authored another function that strips out extra spaces, get rid of all URL strings, takes out the retweet header if exist in tweets, removes hashtags, and eliminates references to other people’s tweet handles. For all these transformations we require string handling functions that resides within stringr package. I have created one function cleanTweets() to perform all this task.
str_replace_all() function is taking three arguments: first is the input, which is a vector of character strings (we are using the temporary data object “tweets” as the source of the text data as well as its destination), the second is the regular expression to match, and the third is the string to use to replace the matches, in this case the empty string as signified by two double quotes with nothing between them. The regular expression in this case is at sign, “@”, followed by zero or more upper and lowercase letters.
Use this cleanTweets() function on only text part of tweets.
Now we have the set of clean tweets in clean_tweet vector. So we can start with text mining, text mining refers to the practice of extracting useful analytic information from corpora of text (corpora is the plural of corpus).Again I have created one function for this.
We “coerce” our clean_tweet vector into a custom “Class” provided by the tm package and called a “Corpus,” storing the result in a new data object called “tweetCorpus” documents. Last four statements are using tm_map() function,we are providing tweetCorpus as the input data, and then we are providing a command that undertakes a transformation on the corpus. We have done four transformations here, first making all of the letters lowercase, then removing the punctuation, then taking out the so called “stop” words and finally converting it to plain text document.
Now, what are stop words? Words such as “the,” “a,” and “at” appeared so commonly in so many different parts of the text that they are useless for differentiating between documents. so we need to filter these commonly used word which are known as stop words.
Using this function on clean_tweet vector:
The final step, creation of word cloud.
At this point we have processed our corpus into a nice uniform “bag of words” that contains no capital letters, punctuation, or stop words. We now need to conduct statistical analysis of the corpus. For this we need to create what is known as a “term-document matrix”. A term-document matrix, also sometimes called a document-term matrix, is a rectangular data structure with terms as the rows and documents as the columns (in other uses you may also make the terms as columns and documents as rows). The process of determining whether words go together in a compound word can be accomplished statistically by seeing which words commonly go together, or it can be done with a dictionary. The statistics reported when we ask for tweetTDM on the command line give us an overview of the results. The TermDocument-Matrix() function extracted 105 different terms from the 100 tweets.
The list of terms and frequencies must be sorted with the most frequent terms appearing first. Second statement of make_cloud() function is to coerce our tweet data back into a plain data matrix so that we can sort it by frequency. Third statement is calculating the sums across each row, which gives us the total frequency of a term across all of the different tweets/documents. We are also sorting the resulting values with the highest frequencies first. The result is a named list: Each item of the list has a frequency and the name of each item is the term to which that frequency applies. We are extracting the names from the named list in the last second command and binding them together into a dataframe with the frequencies. Final statement is creating the cloud by taking termlist and frequencies as input and displays a good graphic:
Its Diwali time, so why not to do on “Diwali”.