Classifying Scientific Publications Using Abstract Features

With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, effective and efficient classification methods are required in order to deliver the appropriate information to specific users or groups. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" representation can result in a large number of features. Feature abstraction helps reduce a classifier input size by learning an abstraction hierarchy over the set of words. A cut through the hierarchy specifies a compressed model, where the nodes on the cut represent abstract features. In this paper, we compare feature abstraction with two other methods for dimensionality reduction, i.e., feature selection and Latent Dirichlet Allocation (LDA). Experimental results on two data sets of scientific publications show that classifiers trained using abstract features significantly outperform those trained using features that have the highest average mutual information with the class, and those trained using the topic distribution and topic words output by LDA. Furthermore, we propose an approach to automatic identification of a cut in order to trade off the complexity of classifiers against their performance. Our results demonstrate the feasibility of the proposed approach.

[1]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[2]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[3]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[6]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[7]  H. Akaike A new look at the statistical model identification , 1974 .

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[12]  Vasant Honavar,et al.  Multinomial Event Model Based Abstraction for Sequence and Text Classification , 2005, SARA.

[13]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[14]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[15]  Marie desJardins,et al.  Using Feature Hierarchies in Bayesian Network Learning , 2000, SARA.

[16]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[17]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[20]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[21]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[22]  Pedro Larrañaga,et al.  An Introduction to Probabilistic Graphical Models , 2002, Estimation of Distribution Algorithms.

[23]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[24]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[25]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[26]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[27]  R. Bekkerman Distributional Word Clusters vs , 2006 .

[28]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[29]  Vasant Honavar,et al.  Combining Super-Structuring and Abstraction on Sequence Classification , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[30]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[31]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[32]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.