Information Filtering using Index Word Selection based on the Topics

We have proposed an information filtering system using index word selection from a document set based on the topics included in a set of documents. This method narrows down the particularly characteristic words in a document set and the topics are obtained by Sparse Non-negative Matrix Factorization. In information filtering, a document is often represented with the vector in which the elements correspond to the weight of the index words, and the dimension of the vector becomes larger as the number of documents is increased. Therefore, it is possible that useless words as index words for the information filtering are included. In order to address the problem, the dimension needs to be reduced. Our proposal reduces the dimension by selecting index words based on the topics included in a document set. We have applied the Sparse Non-negative Matrix Factorization to the document set to obtain these topics. The filtering is carried out based on a centroid of the learning document set. The centroid is regarded as the user’s interest. In addition, the centroid is represented with a document vector whose elements consist of the weight of the selected index words. Using the English test collection MEDLINE, thus, we confirm the effectiveness of our proposal. Hence, our proposed selection can confirm the improvement of the recommendation accuracy from the other previous methods when selecting the appropriate number of index words. In addition, we discussed the selected index words by our proposal and we found our proposal was able to select the index words covered some minor topics included in the document set. Keywords— Information Filtering, Sparse NMF, Index word Selection, User Profile, Chi-squared Measure

[1]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[2]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[3]  Michael W. Berry,et al.  Information retrieval and filtering using the reimannian svd , 1998 .

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[6]  Yukio Ohsawa,et al.  KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[7]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[8]  L. K. Hansen,et al.  Independent Components in Text , 2000 .

[9]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[10]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[11]  Nanning Zheng,et al.  Non-negative matrix factorization for visual coding , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[13]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[14]  Kenji Kita,et al.  Dimensionality reduction using non-negative matrix factorization for information retrieval , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).