Improving Text Classification by Shrinkage in a Hierarchy of Classes

When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples. This paper shows that the accuracy of a naive Bayes text classi er can be signi cantly improved by taking advantage of a hierarchy of classes. We adopt an established statistical technique called shrinkage that smoothes parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates. The approach is also employed in deleted interpolation, a technique for smoothing n-grams in language modeling for speech recognition. Our method scales well to large data sets, with numerous categories in large hierarchies. Experimental results on three real-world data sets from UseNet, Yahoo, and corporate web pages show improved performance, with a reduction in error up to 29% over the traditional at classi er.

[1]  H FriedmanJerome On Bias, Variance, 0/1Loss, and the Curse-of-Dimensionality , 1997 .

[2]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[3]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[4]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[5]  Yoram Singer,et al.  Adaptive Mixtures of Probabilistic Transducers , 1995, Neural Computation.

[6]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[7]  A. Rukhin Bayes and Empirical Bayes Methods for Data Analysis , 1997 .

[8]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Ronald Rosenfeld,et al.  Using story topics for language model adaptation , 1997, EUROSPEECH.

[11]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[12]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[13]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[14]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[15]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[16]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[17]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[18]  H. Johnson,et al.  A comparison of 'traditional' and multimedia information systems development practices , 2003, Inf. Softw. Technol..