My last post was on web crawling and extraction of reviews and ratings from flipkart for MOTO G (2nd generation) phone. Hope you have files and R-code saved on your system. If not you can go through the post again, here is the link.
Creation of Term-Document matrix:
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf.
What is TF-IDF?
tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus of words.
I have created two TDM matrices: one with frequency of terms in each documents and other with tf-idf scores of each terms within each document.
Word Cloud Formation:
It is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance.
I have created WordCloud on the basis of frequency of each terms used in each document and it comes out to be:
Words like “phone”, “battery”, “moto”, “good”, “camera” etc. got highlighted in this cloud. So we can say that people have used these words too much in their statements. They have talked about camera, battery, screen, display, means hardware specifications a lot. They have also talked about its “performance”, “price”, “time”, “better” etc. which means they have expressed their views on the performance of this phone. Word “flipkart” got highlighted, as this phone is only available on flipkart hence they might have talked about the delivery process of flipkart. Rest words that are displayed in smallest font size, are also important words but their frequency count is little less. From these words we can say that people have compared this phone with similar Xiaomi and Samsung products.
R Code to do this:
#creating wordcloud #packages required:- tm,wordcloud corpus=Corpus(VectorSource(reviews[1:100])) corpus=tm_map(corpus,tolower) corpus=tm_map(corpus,removePunctuation) corpus=tm_map(corpus,removeNumbers) corpus=tm_map(corpus,removeWords,stopwords("en")) corpus=Corpus(VectorSource(corpus)) tdm=TermDocumentMatrix(corpus) m=as.matrix(tdm) v=sort(rowSums(m),decreasing=T) d=data.frame(words=names(v),freq=v) wordcloud(d$words,d$freq,max.words=300,colors=brewer.pal(10,"Dark2"),scale=c(3,0.5),random.order=F)
This is a chunk of code for doing this specific task, I will soon share the full code in my subsequent post.
In my next post, I will continue working on reviews, will do dimension reduction using LSA, keep following Text Analytics Part III