Partial orders for document representation: a new methodology for combining document features

This paper describes a novel paradigm for representing many types of information about documents in a manner particularly suited to text categorization by a trivial empirical rule induction system. It also has potential application to full-text retrieval paradigms. The paradigm allows many different types of document predicates to be combined together with logical dependencies being controlled for. This is shown to be justified by any reasonable model of descriptor inference, and the effect of increasing representation sophistication is shown for two corpora.

[1]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[2]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[3]  Chris Buckley,et al.  Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models , 1992, TREC.

[4]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[5]  Norbert Fuhr,et al.  Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[6]  Steven Finch,et al.  Exploiting Sophisticated Representations for Document Retrieval , 1994, ANLP.

[7]  Chris Buckley,et al.  The Importance of Proper Weighting Methods , 1993, HLT.

[8]  Lisa F. Rau,et al.  SCISOR: extracting information from on-line news , 1990, CACM.

[9]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[10]  Philip J. Hayes,et al.  TCS: a shell for content-based text categorization , 1990, Sixth Conference on Artificial Intelligence for Applications.

[11]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[12]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[15]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.