Text Analytics Part IV – Cluster Analysis on Terms and Documents using R

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

We have done cluster analysis separately for both terms and documents after dimension reduction (LSA), what we have done in Text Analytics Part III – Dimension Reduction using R on this.

Cluster Analysis for “Terms”:

I have started with finding the optimal no of clusters using “hierarchical clustering” using ward method. From below dendrogram, we got 4 options “3”, “5”, “6” and “7”. From these four options we need to select one which will be the most optimal no of clusters. With the help of k-means clustering and the size of clusters formed for each of above four options, we reached to the solution i.e. found “6” to be the best case.

Cluster Dendogram

Below is the cluster plot, showing which terms belong to which cluster. Except cluster no 4, all are overlapped. It is not very clear to infer anything from this plot.

Clustering Terms

Size for each cluster is:

Cluster Size

To look into the clusters, we have created the WordCloud of each cluster which comes as below:

Cluster WordCloud

Now terms got cleared within each cluster.

Cluster1: This cluster seems to be comparison of MOTO-G’s features with phones like Xiaomi redmi and Nexus on features like “touch”, ‘design”, “memory card slot”, “application updates” etc

Cluster2: This cluster purely tells about MOTO-G (2nd gen) phone, its specifications, its battery backup, its performance, its availability on flipkart etc.

Cluster5: MOTO-G compared with Asus zenfone on hardware parts like “touch”, “buttons”, “colors”, “models” etc.

Cluster3, cluster4, cluster6: nothing much can be inferred.

Cluster Analysis for “Documents”:

For this also, I have started with finding the optimal no of clusters using “hierarchical clustering” using ward method. From below dendrogram, we got 3 options “3”, “4” and “5”. From these three options we need to select one which will be the most optimal no of clusters. With the help of k-means clustering and the size of clusters formed for each of above options, we reached to the solution i.e. found “3” to be the best case.

Cluster Dendogram Documents

Below is the cluster plot, showing which document belongs to which cluster. For documents, the cluster plot seems to be clear. It clearly shows which cluster comprises of which documents.

Clustering Documents

Size for each cluster is:

Cluster Size Documents

R Code to do this:

#LSA using SVD
#rTextTools,lsa,tm
tdm=create_matrix(reviews,removeNumbers=T)
tdm_tfidf=weightTfIdf(tdm)
m=as.matrix(tdm)
m_tfidf=as.matrix(tdm_tfidf)

lsa_m=lsa(t(m),dimcalc_share(share=0.8))
lsa_m_tk=as.data.frame(lsa_m$tk)
lsa_m_dk=as.data.frame(lsa_m$dk)
lsa_m_sk=as.data.frame(lsa_m$sk)

#randomly creating 150 clusters with k-means
k150_m_tk=kmeans(scale(lsa_m_tk),centers=150,nstart=20)
c150_m_tk=aggregate(cbind(V1,V2,V3)~k150_m_tk$cluster,data=lsa_m_tk,FUN=mean)

k150_m_dk=kmeans(scale(lsa_m_dk),centers=50,nstart=20)
c150_m_dk=aggregate(cbind(V1,V2,V3)~k150_m_dk$cluster,data=lsa_m_dk,FUN=mean)

#hierarchical clustering to find optimal no of clusters for c150_m_tk 
d=dist(scale(c150_m_tk[,-1]))
h=hclust(d,method='ward.D')
plot(h,hang=-1)
rect.hclust(h,h=20,border="blue")
rect.hclust(h,h=12,border="cyan")
rect.hclust(h,h=15,border="red")
#3,5,6
#6
k6_m_tk=kmeans(scale(lsa_m_tk),centers=6,nstart=20)
c6_m_tk=aggregate(cbind(V1,V2,V3)~k6_m_tk$cluster,data=lsa_m_tk,FUN=mean)

#hierarchical clustering to find optimal no of clusters for c150_m_dk 
d=dist(scale(c150_m_dk[,-1]))
h=hclust(d,method='ward.D')
plot(h,hang=-1)
rect.hclust(h,h=5,border="blue")
rect.hclust(h,h=15,border="red")
rect.hclust(h,h=8,border="green")
#3,4,5
#3

k3_m_dk=kmeans(scale(lsa_m_dk),centers=3,nstart=20)
c3_m_dk=aggregate(cbind(V1,V2,V3)~k3_m_dk$cluster,data=lsa_m_dk,FUN=mean)

clusplot(lsa_m_dk, k3_m_dk$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

#Result of clustering on lsa_m_tk
v=sort(colSums(m),decreasing=T)
wordFreq=data.frame(words=names(v),freq=v)
k6_1_m_tk=wordFreq[k6_m_tk$cluster==1,]
k6_2_m_tk=wordFreq[k6_m_tk$cluster==2,]
k6_3_m_tk=wordFreq[k6_m_tk$cluster==3,]
k6_4_m_tk=wordFreq[k6_m_tk$cluster==4,]
k6_5_m_tk=wordFreq[k6_m_tk$cluster==5,]
k6_6_m_tk=wordFreq[k6_m_tk$cluster==6,]

wordcloud(k6_1_m_tk$words,k6_1_m_tk$freq,max.words=154,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_2_m_tk$words,k6_2_m_tk$freq,max.words=300,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_3_m_tk$words,k6_3_m_tk$freq,max.words=39,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_4_m_tk$words,k6_4_m_tk$freq,max.words=3,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_5_m_tk$words,k6_5_m_tk$freq,max.words=99,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_6_m_tk$words,k6_6_m_tk$freq,max.words=32,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)

clusplot(lsa_m_tk, k6_m_tk$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

#lsa_m_tk
lsa_m_tk3=data.frame(words=rownames(lsa_m_tk),lsa_m_tk[,1:3])
plot(lsa_m_tk3$V1,lsa_m_tk3$V2)
text(lsa_m_tk3$V1,lsa_m_tk3$V2,label=lsa_m_tk3$words)

plot(lsa_m_tk3$V2,lsa_m_tk3$V3)
text(lsa_m_tk3$V2,lsa_m_tk3$V3,label=lsa_m_tk3$words)

plot(lsa_m_tk3$V1,lsa_m_tk3$V3)
text(lsa_m_tk3$V1,lsa_m_tk3$V3,label=lsa_m_tk3$words)

#Result of clustering on lsa_m_dk
lsa_m_dk=cbind(1:100,lsa_m_dk)
k3_1_m_dk=lsa_m_dk[k3_m_dk$cluster==1,]
k3_2_m_dk=lsa_m_dk[k3_m_dk$cluster==2,]
k3_3_m_dk=lsa_m_dk[k3_m_dk$cluster==3,]

colnames(lsa_m_dk)[1]="doc"

plot(lsa_m_dk$V1,lsa_m_dk$V2)
text(lsa_m_dk$V1,lsa_m_dk$V2,label=lsa_m_dk$doc)

plot(lsa_m_dk$V2,lsa_m_dk$V3)
text(lsa_m_dk$V2,lsa_m_dk$V3,label=lsa_m_dk$doc)

plot(lsa_m_dk$V1,lsa_m_dk$V3)
text(lsa_m_dk$V1,lsa_m_dk$V3,label=lsa_m_dk$doc)

This is a chunk of code for doing LSA and Clustering on terms and documents seperately, Full code written in R for doing this analysis is here.

Till now I have been working on reviews only, lets have a look on ratings as well associated with each reviews. I have discussed this in my next post Text Analytics Part V.

Share the joy
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *