Text Analytics Part III – Dimension Reduction using R

Till now we have done the very basic things in Text Analytics Part I and Text Analytics Part II.

Now lets do some thing which is little difficult to understand, multidimensional thing, for this I have gone for dimension reduction i.e. Latent Semantic Analysis (LSA) using singular value decomposition (SVD) because of following two reasons:

  1. As there are 100 dimensions and 2282 terms, therefore, it will be difficult to analyze all these at the same time.
  2. TDM is essentially a very sparse matrix (99% sparseness is very common). So to remove sparseness, LSA is used.

What are LSA and SVD?

Latent semantic analysis (LSA) is a technique in natural language processing, used for analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns.

In simpler words, LSA gives a way of comparing documents at a higher level than the terms by introducing a concept called the feature and SVD is a way of extracting features from documents.

The 3 matrices generated: Tk, Dk, Sk:

The diagonal matrix Sk contains the singular values of TDM in descending order. The ith singular value indicates the amount of variation along ith axis. Tk will consist of terms and values along each dimension. Dk will consist of documents and its values along each dimension.

We can find the best approximated TDM by Tk*Sk*DkT.

For MOTO-G, I have found below three matrices and 50 dimensions after reduction.

LSA matrices

Still, 50 dimensions are also too much for analysis, so I have chosen 3 dimensions to start with our analysis work. As we can see, from matrix SK, dimensions V1, V2, V3 have highest singular value, which means highest variation along these 3 dimensions, therefore selecting these 3 dimensions.
When terms were plotted against these 3 dimensions (using TK matrix), I got below graphs:

lsa_m_tk_v1_v2

Above graph shows the positioning of each term in a 2 dimensional vector space. When we compare two terms we compare the cosine of the angle between the vectors representing the terms. For example, term “phone” is more towards the dimension V2 and “moto” is more towards dimension V1.

lsa_m_tk_v1_v3

Similarly, this graph shows the placements of terms between V1 and V3 dimensions. with the help of terms like “battery”, ”great”, ”games”, ”android” etc, we can say that dimension V1 constitutes the specification of this phone, the features of this phone.

lsa_m_tk_v2_v3

In this graph, cluster of words seems to be equally aligned with both the dimensions.

When documents were plotted against these 3 dimensions (using DK matrix), I got below graphs:

lsa_m_dk_v1_v2

From these graph, we can say that the documents that are aligned more towards dimension V1, are talking about the specifications of the phone. As we have seen that dimension V1 constitutes the terms that talks about the features of this phone.
Documents 48, 49, 90, 99 etc. are aligned more towards dimension V2 than V1.

lsa_m_dk_v1_v3

lsa_m_dk_v2_v3

Similarly, for these two graphs.

R Code to do this:

Before you start writing code for LSA, refer Text Analytics Part IV on clustering of terms and documents; it will help you to understand better. I am sharing full codes written during this analysis work with you all, please find it here.

Share the joy
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

2 thoughts on “Text Analytics Part III – Dimension Reduction using R

Leave a Reply

Your email address will not be published. Required fields are marked *