Combining machine learning and hierarchical structures for text categorization

Text categorization is the process of algorithmically analyzing an electronic document to assign a set of categories (or index terms) that succinctly describe the content of the document. This assignment can be used for classification, filtering, or information retrieval purposes. Machine learning methods such as decision trees, inductive learning, neural networks, support vector machines, linear classifiers, k-nearest neighbor, and Bayesian learning have been applied to solve this problem but most of these applications ignore the hierarchical structure of the underling classification vocabulary. This dissertation focuses on the use of hierarchical classification structures, such as the UMLS Metathesaurus or the Yahoo! hierarchy of topics, to build and train machine learning algorithms for text categorization. For this purpose we use a variation of the Hierarchical Mixtures of Experts (HME) model adapted for text categorization. We evaluate the HME model using neural networks, and linear classifier as the nodes of the hierarchy. We explore in detail the use of different feature and training set selection methods. Experimental results are reported using a large collection of MEDLINE documents (OHSUMED collection) to assess the effectiveness of the HME model for in text categorization.

[1]  Rich Caruana,et al.  Introduction to IND and recursive partitioning, version 1.0 , 1991 .

[2]  C. J. van Rijsbergen,et al.  The selection of good search terms , 1981, Inf. Process. Manag..

[3]  Dik Lun Lee,et al.  Feature reduction for neural network based text categorization , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[4]  Cornelis H. A. Koster,et al.  Four text classification algorithms compared on a Dutch corpus , 1998, SIGIR '98.

[5]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[6]  William W. Cohen Text Categorization and Relational Learning , 1995, ICML.

[7]  Norbert Fuhr,et al.  A probabilistic model of dictionary based automatic indexing , 1985, RIAO.

[8]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[9]  Marti A. Hearst Noun Homograph Disambiguation Using Local Context in Large Text Corpora , 1991 .

[10]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[11]  Thorsten Joachims,et al.  Estimating the Generalization Performance of an SVM Efficiently , 2000, ICML.

[12]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[13]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[14]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[15]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[16]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[17]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[18]  Vasileios Hatzivassiloglou,et al.  Text-based approaches for non-topical image categorization , 2000, International Journal on Digital Libraries.

[19]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[20]  Leah S. Larkey,et al.  Automatic essay grading using text categorization techniques , 1998, SIGIR '98.

[21]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[22]  B. J. Field TOWARDS AUTOMATIC INDEXING: AUTOMATIC ASSIGNMENT OF CONTROLLED‐LANGUAGE INDEXING AND CLASSIFICATION FROM FREE INDEXING , 1975 .

[23]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[24]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[25]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[26]  Gerard Salton,et al.  Optimization of relevance feedback weights , 1995, SIGIR '95.

[27]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[28]  W. G. Hoyle Automatic indexing and generation of classification systems by algorithm , 1973, Inf. Storage Retr..

[29]  Wai Lam,et al.  Using a Bayesian Network Induction Approach for Text Categorization , 1997, IJCAI.

[30]  Dunja Mladenic,et al.  Turning {{\sc Yahoo!}}\ into an automatic Web page classifier , 1998 .

[31]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[32]  T. Joachims WebWatcher : A Tour Guide for the World Wide Web , 1997 .

[33]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[34]  David L. Waltz,et al.  Trading MIPS and memory for knowledge engineering , 1992, CACM.

[35]  Paul H. Klingbiel Machine-aided indexing of technical literature , 1973, Inf. Storage Retr..

[36]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[37]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[38]  R. M. Frumkina,et al.  Meaning and Categorization , 1996 .

[39]  Gerhard Lustig,et al.  The EURATOM automatic indexing project , 1968, IFIP Congress.

[40]  Y Yang,et al.  An evaluation of statistical approaches to MEDLINE indexing. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[41]  William K. Estes,et al.  Classification and cognition , 1994 .

[42]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[43]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[44]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[45]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[46]  Vasileios Hatzivassiloglou,et al.  Text-Based Approaches for the Categorization of Images , 1999, ECDL.

[47]  Yiming Yang,et al.  An application of least squares fit mapping to text information retrieval , 1993, SIGIR.

[48]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[49]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[50]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[51]  Yiming Yang,et al.  Using Corpus Statistics to Remove Redundant Words in Text Categorization , 1996, J. Am. Soc. Inf. Sci..

[52]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[53]  Gregory L. Murphy,et al.  Hierarchical structure in concepts and the basic level of categorization. , 1997 .

[54]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[55]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[56]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[57]  Chris Buckley,et al.  Learning routing queries in a query zone , 1997, SIGIR '97.

[58]  David A. Hull,et al.  Dean of Graduate Studies , 2000 .

[59]  Stephen E. Robertson,et al.  The TREC-9 filtering track , 1999, SIGF.

[60]  H. S. Heaps,et al.  A Theory of Relevance for Automatic Document Classification , 1973, Inf. Control..

[61]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[62]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[63]  Harold Borko,et al.  Automatic Document Classification , 1963, JACM.

[64]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[65]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[66]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[67]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[68]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[69]  Leah S. Larkey,et al.  A patent search and classification system , 1999, DL '99.

[70]  Isabelle Moulinier,et al.  Applying an existing machine learning algorithm to text categorization , 1995, Learning for Natural Language Processing.

[71]  R. T. Dattola,et al.  A Fast Algorithm for Automatic Classification , 1969 .

[72]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[73]  Yves Chauvin,et al.  Backpropagation: the basic theory , 1995 .

[74]  Norbert Fuhr,et al.  AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.

[75]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[76]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[77]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[78]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[79]  Steven R. Waterhouse,et al.  Classification and Regression using Mixtures of Experts , 1997 .

[80]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[81]  Masahiko Haruno,et al.  Feature Selection in SVM Text Categorization , 1999, AAAI/IAAI.

[82]  Stephen Robertson,et al.  Probabilistic Automatic Indexing by Learning from Human indexers , 1984, J. Documentation.

[83]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization , 1999, SIGIR 1999.

[84]  Alessandro Sperduti,et al.  An improved boosting algorithm and its application to text categorization , 2000, CIKM '00.

[85]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[86]  David A. Hull The TREC-7 Filtering Track: Description and Analysis , 1998, Text Retrieval Conference.

[87]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[88]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[89]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[90]  W. Alex Gray,et al.  Computer assisted indexing , 1971, Information Storage and Retrieval.

[91]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[92]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[93]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[94]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[95]  Cyril Cleverdon,et al.  Optimizing convenient online access to bibliographic databases , 1984 .

[96]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[97]  Norbert Fuhr,et al.  The automatic indexing system AIR/PHYS - from research to applications , 1988, SIGIR '88.

[98]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[99]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[100]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[101]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[102]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[103]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[104]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[105]  Elizabeth D. Liddy,et al.  Feature selection in text categorization using the Baldwin effect , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[106]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[107]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[108]  David D. Lewis,et al.  A sequential algorithm for training text classifiers: corrigendum and additional data , 1995, SIGF.

[109]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[110]  Isabelle Moulinier,et al.  A Framework for Comparing Text Categorization Approaches , 2002 .

[111]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[112]  Norbert Fuhr,et al.  Retrieval Test Evaluation of a Rule Based Automatic Index (AIR/PHYS) , 1984, SIGIR.

[113]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[114]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[115]  Padmini Srinivasan,et al.  Automatic Text Categorization Using Neural Networks , 1997 .

[116]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[117]  Yiming Yang,et al.  Improving text categorization methods for event tracking , 2000, SIGIR '00.

[118]  Antonio Zamora,et al.  The use of titles for automatic document classification , 1980, J. Am. Soc. Inf. Sci..

[119]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[120]  van Rijsbergen,et al.  Automatic Classification in Information Retrieval. , 1978 .

[121]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[122]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[123]  B. Berlin,et al.  Ethnobiological Classification: Principles of Categorization of Plants and Animals in Traditional Societies. , 1994 .