A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR)

Traditionally in the vector space model of document representation for various IR (Information Retrieval) tasks, all the content words are used without considering their individual significance in the language. Such methods treat a document as a bag-of-words and do not exploit any language related information. It is obvious that considering such information in representing the documents can help in improving the performance of various IR tasks, but how to obtain this information is considered to be difficult. One of the information that can be important is the knowledge about the role of various parts-of-speech (POS). Although importance of various POS is very subjective and depends on the application as well as the domain under consideration, it can be very useful to evaluate their importance even in a general setup. In this paper we present a study to understand this importance. We first generate the document vectors using particular POS. We then evaluate how good is this representation. This is done by measuring the information content provided by document vectors. This information is then used to reconstruct the document vectors. In order to show that these document vectors are better than those of generated by traditional methods, we consider text classification application. We show some improvement in classification accuracy, but more importantly, we demonstrate the consistency in the results and a step toward a new and promising direction for using semantics for IR tasks.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Min-Yen Kan,et al.  Role of Verbs in Document Analysis , 1998, ACL.

[3]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[4]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[5]  B. P. Lathi An introduction to random signals Communication theory , 1968 .

[6]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[7]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[9]  Yiming Yang,et al.  Using Corpus Statistics to Remove Redundant Words in Text Categorization , 1996, J. Am. Soc. Inf. Sci..

[10]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[11]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[12]  Michael L. Littman,et al.  Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus , 2002, ArXiv.

[13]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[14]  Yiming Yang,et al.  Using corpus statistics to remove redundant words in text categorization , 1996 .

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  J. Davenport Editor , 1960 .

[17]  Rong Jin,et al.  Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall , 2001, SIGIR '01.

[18]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[19]  Takenobu Tokunaga,et al.  Text Categorization based on Weighted Inverse Document Frequency , 1994 .

[20]  Peter D. Turney Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data , 2002, ArXiv.

[21]  Roger Garside,et al.  A hybrid grammatical tagger: CLAWS4 , 1997 .