Rich document representation and classification: An analysis

There are three factors involved in text classification. These are classification model, similarity measure and document representation model. In this paper, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classifier. In our experiments, we have used the centroid-based text classifier, which is a simple and robust text classification scheme. We will compare four different types of document representations: N-grams, Single terms, phrases and RDR which is a logic-based document representation. The N-gram representation is a string-based representation with no linguistic processing. The Single term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. We have experimented with many text collections and we have obtained similar results. Here, we base our arguments on experiments conducted on Reuters-21578. We show that RDR, the more complex representation, produces more effective classifier on Reuters-21578, followed by the phrase approach.

[1]  Zoltan Domotor,et al.  Probability kinematics , 2004, Synthese.

[2]  Paul R. Cohen,et al.  The evolution and performance of the GRANT System , 1987, IEEE Expert.

[3]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Elizabeth D. Liddy,et al.  Enhanced Text Retrieval Using Natural Language Processing , 2005 .

[5]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[6]  Charles Nicholas,et al.  TELLTALE: experiments in a dynamic hypertext environment for degraded and multilingual data , 1996 .

[7]  Farhad Oroumchian,et al.  Rich Document Representation for Document Clustering , 2004, RIAO.

[8]  Mostafa Keikha,et al.  Using Rich Document Representation in XML Information Retrieval , 2006, INEX.

[9]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[10]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[11]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decision-making , 1988 .

[12]  George Karypis,et al.  Centroid-Based Document Classification Algorithms: Analysis & Experimental Results , 2000 .

[13]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[14]  Robert N. Oddy,et al.  An application of plausible reasoning to information retrieval , 1996, SIGIR '96.

[15]  Zeshui Xu,et al.  An overview of methods for determining OWA weights , 2005, Int. J. Intell. Syst..

[16]  Zeshui Xu,et al.  Alternative form of Dempster's rule for binary variables: Research Articles , 2005 .

[17]  Ryszard S. Michalski,et al.  The Logic of Plausible Reasoning: A Core Theory , 1989, Cogn. Sci..

[18]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[19]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[20]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[21]  Joon Ho Lee,et al.  Properties of extended Boolean models in information retrieval , 1994, SIGIR '94.

[22]  Claudia Pearce,et al.  TELLTALE: Experiments in a Dynamic Hypertext Environment for Degraded and Multilingual Data , 1996, J. Am. Soc. Inf. Sci..

[23]  Farhad Oroumchian,et al.  N-gram and Local Context Analysis for Persian text retrieval , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[24]  Elena M. Zamora,et al.  The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..

[25]  Ed Greengrass,et al.  Information Retrieval: A Survey , 2000 .

[26]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decisionmaking , 1988, IEEE Trans. Syst. Man Cybern..

[27]  Fabio Crestani,et al.  Probability kinematics in information retrieval , 1995, SIGIR '95.

[28]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.