Exploiting structural information for semi-structured document categorization

This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a fiat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction.

[1]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[4]  Christopher Meek,et al.  Challenges of the Email Domain for Text Classification , 2000, ICML.

[5]  Ian H. Witten,et al.  Stacked generalization: when does it work? , 1997, IJCAI 1997.

[6]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[7]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[8]  Hisham M. Haddad,et al.  Proceedings of the 2003 ACM Symposium on Applied Computing (SAC), March 9-12, 2003, Melbourne, FL, USA , 2003, SAC.

[9]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[10]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[13]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[14]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[15]  George Forman,et al.  Learning from Little: Comparison of Classifiers Given Little Training , 2004, PKDD.

[16]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[17]  Yiming Yang,et al.  Hypertext Categorization using Hyperlink Patterns and Meta Data , 2001, ICML.

[18]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[19]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[20]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[21]  R. Bekkerman Distributional Word Clusters vs , 2006 .

[22]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[23]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[24]  Ludovic Denoyer,et al.  Bayesian network model for semi-structured document classification , 2004, Inf. Process. Manag..

[25]  Susan T. Dumais,et al.  The Combination of Text Classifiers Using Reliability Indicators , 2016, Information Retrieval.

[26]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[27]  Neel Sundaresan,et al.  A classifier for semi-structured documents , 2000, KDD '00.

[28]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[29]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.