Text Analytics Part I – Web Crawling using R

Most of us use India’s most popular shopping site flipkart for viewing the specifications of electronic goods especially cell phones. Before buying any phone, people generally visit this site and look for reviews of their products which they are planning to buy.

That’s why I have choosen this site as a work for my analysis and tried to analyse reviews and ratings given by customers after using MOTO G (2nd gen) black colored phone, a product which is ONLY available on flipkart.

Constraints:

I have considered reviews up to 10 pages. Each page contains 10 reviews, therefore total 100 reviews considered for analysis. I haven’t ignored small reviews (less than 200 characters) as people may also write their views in one liner sentence as well.

Everything is done using R-Studio

Web Crawling:

A crawler is a program that retrieves and stores pages from the Web, generally used by the Web search engines to index web pages in their systems. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putting too much pressure on the visited Web sites and the crawler’s local network, because they are intrinsically shared resources.

Extracted Reviews:

Using this web crawling technique, I have extracted reviews for  analysis. Not all reviews are of same length, as we have observed that in site also, some people write detailed reviews and some write one liner to express their views, which equally have same weightage

 Extracted Ratings:

With reviews, I have captured the ratings given by each customer (out of 5). The customers who have written the reviews have also given the ratings. We have done analysis on ratings after analyzing the documents and terms extracted from reviews.

extracted reviews_ratings

R Code to do this:

#web-crawling
init="http://www.flipkart.com/moto-g-2nd-gen/product reviews/ITME3H4V4HKCFFCS?pid=MOBDYGZ6SHNB7RFC&type=top"
crawlcandidate="start="
base="http://www.flipkart.com"
num=10

doclist=list()
anchorlist=vector()

j=0

while(j < num){
  
  if(j==0){
    doclist[j+1]=getURL(init)
  }else{
    doclist[j+1]=getURL(paste(base,anchorlist[j+1],sep=""))
  }
  doc=htmlParse(doclist[[j+1]])
  anchor=getNodeSet(doc,"//a")
  anchor=sapply(anchor,function(x)xmlGetAttr(x,"href"))
  anchor=anchor[grep(crawlcandidate,anchor)]
  anchorlist=c(anchorlist,anchor)
  anchorlist=unique(anchorlist)
  j=j+1
}

#html_text is for extracting only reviews and ratings
reviews=c()
ratings=c()
for(i in 1:10){
  doc=htmlParse(doclist[[i]])
  l=getNodeSet(doc,"//div/p/span[@class='review-text']")
  l1=html_text(l)
  rateNodes=getNodeSet(doc,"//div[@class='fk-stars']")
  rates=sapply(rateNodes,function(x)xmlGetAttr(x,'title'))
  ratings=c(ratings,rates)
  reviews=c(reviews,l1)
}
View(reviews)
View(ratings)

#saving files
save(reviews,file="F:\\TextAnalytics\\Wordcloud\\MOTOG_Reviews.RData")
save(ratings,file="F:\\TextAnalytics\\Wordcloud\\MOTOG_Ratings.RData")

This is a chunk of code for doing this specific task, I will soon share the full code in my subsequent post.

We have got our reviews and ratings files saved on our system. In my next post, I will create a matrix known as Term Document Matrix and will create WordCloud of words, till then Good Bye and keep learning!

Share the joy
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
Roma

Leave a Reply

avatar
  Subscribe  
Notify of