Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

We have done cluster analysis separately for both terms and documents after dimension reduction (LSA), what we have done in Text Analytics Part III – Dimension Reduction using R on this.

#### Cluster Analysis for “Terms”:

I have started with finding the optimal no of clusters using “hierarchical clustering” using ward method. From below dendrogram, we got 4 options “3”, “5”, “6” and “7”. From these four options we need to select one which will be the most optimal no of clusters. With the help of k-means clustering and the size of clusters formed for each of above four options, we reached to the solution i.e. found “6” to be the best case.

Below is the cluster plot, showing which terms belong to which cluster. Except cluster no 4, all are overlapped. It is not very clear to infer anything from this plot.

Size for each cluster is:

To look into the clusters, we have created the WordCloud of each cluster which comes as below:

Now terms got cleared within each cluster.

**Cluster1:** This cluster seems to be comparison of MOTO-G’s features with phones like Xiaomi redmi and Nexus on features like “touch”, ‘design”, “memory card slot”, “application updates” etc

**Cluster2: **This cluster purely tells about MOTO-G (2^{nd} gen) phone, its specifications, its battery backup, its performance, its availability on flipkart etc.

**Cluster5: **MOTO-G compared with Asus zenfone on hardware parts like “touch”, “buttons”, “colors”, “models” etc.

**Cluster3, cluster4, cluster6:** nothing much can be inferred.

#### Cluster Analysis for “Documents”:

For this also, I have started with finding the optimal no of clusters using “hierarchical clustering” using ward method. From below dendrogram, we got 3 options “3”, “4” and “5”. From these three options we need to select one which will be the most optimal no of clusters. With the help of k-means clustering and the size of clusters formed for each of above options, we reached to the solution i.e. found “3” to be the best case.

Below is the cluster plot, showing which document belongs to which cluster. For documents, the cluster plot seems to be clear. It clearly shows which cluster comprises of which documents.

Size for each cluster is:

#### R Code to do this:

```
#LSA using SVD
#rTextTools,lsa,tm
tdm=create_matrix(reviews,removeNumbers=T)
tdm_tfidf=weightTfIdf(tdm)
m=as.matrix(tdm)
m_tfidf=as.matrix(tdm_tfidf)
lsa_m=lsa(t(m),dimcalc_share(share=0.8))
lsa_m_tk=as.data.frame(lsa_m$tk)
lsa_m_dk=as.data.frame(lsa_m$dk)
lsa_m_sk=as.data.frame(lsa_m$sk)
#randomly creating 150 clusters with k-means
k150_m_tk=kmeans(scale(lsa_m_tk),centers=150,nstart=20)
c150_m_tk=aggregate(cbind(V1,V2,V3)~k150_m_tk$cluster,data=lsa_m_tk,FUN=mean)
k150_m_dk=kmeans(scale(lsa_m_dk),centers=50,nstart=20)
c150_m_dk=aggregate(cbind(V1,V2,V3)~k150_m_dk$cluster,data=lsa_m_dk,FUN=mean)
#hierarchical clustering to find optimal no of clusters for c150_m_tk
d=dist(scale(c150_m_tk[,-1]))
h=hclust(d,method='ward.D')
plot(h,hang=-1)
rect.hclust(h,h=20,border="blue")
rect.hclust(h,h=12,border="cyan")
rect.hclust(h,h=15,border="red")
#3,5,6
#6
k6_m_tk=kmeans(scale(lsa_m_tk),centers=6,nstart=20)
c6_m_tk=aggregate(cbind(V1,V2,V3)~k6_m_tk$cluster,data=lsa_m_tk,FUN=mean)
#hierarchical clustering to find optimal no of clusters for c150_m_dk
d=dist(scale(c150_m_dk[,-1]))
h=hclust(d,method='ward.D')
plot(h,hang=-1)
rect.hclust(h,h=5,border="blue")
rect.hclust(h,h=15,border="red")
rect.hclust(h,h=8,border="green")
#3,4,5
#3
k3_m_dk=kmeans(scale(lsa_m_dk),centers=3,nstart=20)
c3_m_dk=aggregate(cbind(V1,V2,V3)~k3_m_dk$cluster,data=lsa_m_dk,FUN=mean)
clusplot(lsa_m_dk, k3_m_dk$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
#Result of clustering on lsa_m_tk
v=sort(colSums(m),decreasing=T)
wordFreq=data.frame(words=names(v),freq=v)
k6_1_m_tk=wordFreq[k6_m_tk$cluster==1,]
k6_2_m_tk=wordFreq[k6_m_tk$cluster==2,]
k6_3_m_tk=wordFreq[k6_m_tk$cluster==3,]
k6_4_m_tk=wordFreq[k6_m_tk$cluster==4,]
k6_5_m_tk=wordFreq[k6_m_tk$cluster==5,]
k6_6_m_tk=wordFreq[k6_m_tk$cluster==6,]
wordcloud(k6_1_m_tk$words,k6_1_m_tk$freq,max.words=154,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_2_m_tk$words,k6_2_m_tk$freq,max.words=300,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_3_m_tk$words,k6_3_m_tk$freq,max.words=39,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_4_m_tk$words,k6_4_m_tk$freq,max.words=3,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_5_m_tk$words,k6_5_m_tk$freq,max.words=99,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
wordcloud(k6_6_m_tk$words,k6_6_m_tk$freq,max.words=32,colors=brewer.pal(8,"Dark2"),scale=c(3,0.5),random.order=F)
clusplot(lsa_m_tk, k6_m_tk$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
#lsa_m_tk
lsa_m_tk3=data.frame(words=rownames(lsa_m_tk),lsa_m_tk[,1:3])
plot(lsa_m_tk3$V1,lsa_m_tk3$V2)
text(lsa_m_tk3$V1,lsa_m_tk3$V2,label=lsa_m_tk3$words)
plot(lsa_m_tk3$V2,lsa_m_tk3$V3)
text(lsa_m_tk3$V2,lsa_m_tk3$V3,label=lsa_m_tk3$words)
plot(lsa_m_tk3$V1,lsa_m_tk3$V3)
text(lsa_m_tk3$V1,lsa_m_tk3$V3,label=lsa_m_tk3$words)
#Result of clustering on lsa_m_dk
lsa_m_dk=cbind(1:100,lsa_m_dk)
k3_1_m_dk=lsa_m_dk[k3_m_dk$cluster==1,]
k3_2_m_dk=lsa_m_dk[k3_m_dk$cluster==2,]
k3_3_m_dk=lsa_m_dk[k3_m_dk$cluster==3,]
colnames(lsa_m_dk)[1]="doc"
plot(lsa_m_dk$V1,lsa_m_dk$V2)
text(lsa_m_dk$V1,lsa_m_dk$V2,label=lsa_m_dk$doc)
plot(lsa_m_dk$V2,lsa_m_dk$V3)
text(lsa_m_dk$V2,lsa_m_dk$V3,label=lsa_m_dk$doc)
plot(lsa_m_dk$V1,lsa_m_dk$V3)
text(lsa_m_dk$V1,lsa_m_dk$V3,label=lsa_m_dk$doc)
```

This is a chunk of code for doing LSA and Clustering on terms and documents seperately, Full code written in R for doing this analysis is **here**.

Till now I have been working on reviews only, lets have a look on ratings as well associated with each reviews. I have discussed this in my next post Text Analytics Part V.