Taxonomic Dimensionality Reduction in Bayesian Text Classification

Lexical abstraction hierarchies can be leveraged to provide semantic information that characterizes features of text corpora as a whole. This information may be used to determine the classification utility of the dimensions that describe a dataset. This paper presents a new method for preparing a dataset for probabilistic classification by determining, a priori, the utility of a very small subset of taxonomically-related dimensions via a Discriminative Multinomial Naive Bayes process. We show that this method yields significant improvements over both Discriminative Multinomial Naive Bayes and Bayesian network classifiers alone.

[1]  Hui Xiong,et al.  On the strength of hyperclique patterns for text categorization , 2007, Inf. Sci..

[2]  L. Getoor,et al.  Link-based Text Classification , 2022 .

[3]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[4]  David M. Pennock,et al.  Statistical relational learning for document mining , 2003, Third IEEE International Conference on Data Mining.

[5]  M. Shahriar Hossain,et al.  GDClust: A Graph-Based Document Clustering Technique , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[6]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[7]  Vasant Honavar,et al.  Learning Link-Based Classifiers from Ontology-Extended Textual Data , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[8]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[9]  Rafal A. Angryk,et al.  Abstracting for Dimensionality Reduction in Text Classification , 2013, Int. J. Intell. Syst..

[10]  Stan Matwin,et al.  Discriminative parameter learning for Bayesian networks , 2008, ICML '08.

[11]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[12]  Wai Lam,et al.  Automatic document classification based on probabilistic reasoning: model and performance analysis , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[13]  George A. Miller,et al.  WordNet: A Lexical Database for the English Language , 2002 .