Supervised Semantic Indexing Using Sub-spacing

Indexing of textual cases is commonly affected by the problem of variation in vocabulary. Semantic indexing is commonly used to address this problem by discovering semantic or conceptual relatedness between individual terms and using this to improve textual case representation. However, representations produced using this approach are not optimal for supervised tasks because standard semantic indexing approaches do not take into account class membership of these textual cases. Supervised semantic indexing approaches e.g. sprinkled Latent Semantic Indexing (SpLSI) and supervised Latent Dirichlet Allocation (sLDA) have been proposed for addressing this limitation. However, both SpLSI and sLDA are computationally expensive and require parameter tuning. In this work, we present an approach called Supervised Sub-Spacing (S3) for supervised semantic indexing of documents. S3 works by creating a separate sub-space for each class within which class-specific term relations and term weights are extracted. The power of S3 lies in its ability to modify document representations such that documents that belong to the same class are made more similar to one another while, at the same time, reducing their similarity to documents of other classes. In addition, S3 is flexible enough to work with a variety of semantic relatedness metrics and yet, powerful enough that it leads to significant improvements in text classification accuracy. We evaluate our approach on a number of supervised datasets and results show classification performance on S3-based representations to significantly outperform both a supervised version of Latent Semantic Indexing (LSI) called Sprinkled LSI, and supervised LDA.

[1]  Sutanu Chakraborti,et al.  Supervised Latent Semantic Indexing Using Adaptive Sprinkling , 2007, IJCAI.

[2]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[3]  George Tsatsaronis,et al.  A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness , 2009, EACL.

[4]  Sutanu Chakraborti,et al.  Acquiring Word Similarities with Higher Order Association Mining , 2007, ICCBR.

[5]  Vijay V. Raghavan,et al.  On modeling of information retrieval concepts in vector spaces , 1987, TODS.

[6]  Kilian Q. Weinberger,et al.  An alternative text representation to TF-IDF and Bag-of-Words , 2013, ArXiv.

[7]  Wei-Ying Ma,et al.  Supervised latent semantic indexing for document categorization , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[8]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[9]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, CVPR.

[10]  Luc Lamontagne,et al.  Case-Based Reasoning Research and Development , 1997, Lecture Notes in Computer Science.

[11]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[12]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[13]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Sutanu Chakraborti,et al.  Sprinkling: Supervised Latent Semantic Indexing , 2006, ECIR.

[16]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[17]  Kevin D. Ashley,et al.  Textual case-based reasoning , 2005, Knowl. Eng. Rev..

[18]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[19]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[20]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.