Index Words Selection with ICA

We propose here a method to select index words for the construction of a document vector from a corpus using the independent component analysis (ICA). It is useful to select index words of a document vector since its dimension is large. The ICA is one of the methods in analyzing the latent semantics of documents. It is reported the independent components obtained by the ICA represent the topics in the documents. The words in the independent component are considered to be the key words of the topic. The proposed method selects the key words which have high weight in each independent component and adds them to a set of index words. In addition, we selected other words related to the key words according to the chi-squared measure between the co-occurrence of the key words and each word and the appearance of the key words, and have also added them to the set of index words. Finally, an evaluation of the index words obtained has been carried out.