Machine learning in automated text categorization

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

[1]  Susan Brewer,et al.  Information storage and retrieval , 1959, ACM '59.

[2]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[3]  Harold Borko,et al.  Automatic Document Classification , 1963, JACM.

[4]  Gerhard Lustig,et al.  The EURATOM automatic indexing project , 1968, IFIP Congress.

[5]  W. Alex Gray,et al.  Computer assisted indexing , 1971, Information Storage and Retrieval.

[6]  Paul H. Klingbiel Machine-aided indexing of technical literature , 1973, Inf. Storage Retr..

[7]  W. G. Hoyle Automatic indexing and generation of classification systems by algorithm , 1973, Inf. Storage Retr..

[8]  H. S. Heaps,et al.  A Theory of Relevance for Automatic Document Classification , 1973, Inf. Control..

[9]  Paul H. Klingbiel A technique for machine-aided indexing , 1973, Inf. Storage Retr..

[10]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[11]  B. J. Field TOWARDS AUTOMATIC INDEXING: AUTOMATIC ASSIGNMENT OF CONTROLLED‐LANGUAGE INDEXING AND CLASSIFICATION FROM FREE INDEXING , 1975 .

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[14]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[15]  Antonio Zamora,et al.  The use of titles for automatic document classification , 1980, J. Am. Soc. Inf. Sci..

[16]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[17]  Gerhard Knorz,et al.  A Decision Theory Approach to Optimal Automatic Indexing , 1982, SIGIR.

[18]  Cyril Cleverdon,et al.  Optimizing convenient online access to bibliographic databases , 1984 .

[19]  Norbert Fuhr,et al.  Retrieval Test Evaluation of a Rule Based Automatic Index (AIR/PHYS) , 1984, SIGIR.

[20]  Stephen Robertson,et al.  Probabilistic Automatic Indexing by Learning from Human indexers , 1984, J. Documentation.

[21]  Norbert Fuhr,et al.  A probabilistic model of dictionary based automatic indexing , 1985, RIAO.

[22]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[23]  Peter Willett,et al.  Document Retrieval Systems , 1988 .

[24]  Norbert Fuhr,et al.  The automatic indexing system AIR/PHYS - from research to applications , 1988, SIGIR '88.

[25]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[26]  Norbert Fuhr,et al.  Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[27]  Richard A. Harshman,et al.  Indexing by latent semantic indexing , 1990 .

[28]  Philip J. Hayes,et al.  TCS: a shell for content-based text categorization , 1990, Sixth Conference on Artificial Intelligence for Applications.

[29]  Marc Goodman,et al.  Prism: A Case-Based Telex Classifier , 1990, IAAI.

[30]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[31]  Lisa F. Rau,et al.  Creating segmented databases from free text for text retrieval , 1991, SIGIR '91.

[32]  William S. Cooper,et al.  Some inconsistencies and misnomers in probabilistic information retrieval , 1991, SIGIR '91.

[33]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[34]  Marti A. Hearst Noun Homograph Disambiguation Using Local Context in Large Text Corpora , 1991 .

[35]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[36]  Norbert Fuhr,et al.  AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.

[37]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[38]  W. Bruce Croft,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[39]  David L. Waltz,et al.  Trading MIPS and memory for knowledge engineering , 1992, CACM.

[40]  Richard M. Tong,et al.  Classification Trees for Document Routing, A Report on the TREC Experiment , 1992, TREC.

[41]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[42]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[43]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[44]  Brij Masand,et al.  Optimizing confidence of text classification by evolution of symbolic expressions , 1994 .

[45]  Louise Guthrie,et al.  Document Classification By Machine: Theory and Practice , 1994, COLING.

[46]  Elizabeth D. Liddy,et al.  Text categorization for multiple users based on semantic features from a machine-readable dictionary , 1994, TOIS.

[47]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[48]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[49]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[50]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[51]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[52]  Philip J. Hayes,et al.  Guest Editorial - Special Issue on Text Categorization , 1994, ACM Trans. Inf. Syst..

[53]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[54]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[55]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[56]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[57]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[58]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[59]  Norbert Fuhr,et al.  Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions , 1994, TOIS.

[60]  William S. Cooper,et al.  Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval , 1995, TOIS.

[61]  Ellen Riloff,et al.  Little words can make a big difference for text classification , 1995, SIGIR '95.

[62]  Isabelle Moulinier,et al.  Applying an existing machine learning algorithm to text categorization , 1995, Learning for Natural Language Processing.

[63]  David D. Lewis,et al.  A sequential algorithm for training text classifiers: corrigendum and additional data , 1995, SIGF.

[64]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[65]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[66]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[67]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[68]  William W. Cohen Text Categorization and Relational Learning , 1995, ICML.

[69]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[70]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[71]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[72]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[73]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[74]  Ping Li,et al.  Competition : A Connectionist Model of the Learning of English Reversive Prefixes , 2010 .

[75]  Hinrich Schütze,et al.  Method combination for document filtering , 1996, SIGIR '96.

[76]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[77]  Gilbert H. Young,et al.  ACTION: automatic classification for full-text documents , 1996, SIGF.

[78]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[79]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[80]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[81]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[82]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[83]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[84]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[85]  Markus Junker,et al.  Exploiting Thesaurus Knowledge in Rule Induction for Text Classification , 1997 .

[86]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[87]  Manuel de Buenaga Rodríguez,et al.  Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[88]  Prasad Tadepalli,et al.  Active Learning with Committees for Text Categorization , 1997, AAAI/IAAI.

[89]  Thomas Brückner,et al.  The text categorization system TEKLIS at TREC-6 , 1997, TREC.

[90]  Wai Lam,et al.  Using a Bayesian Network Induction Approach for Text Categorization , 1997, IJCAI.

[91]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[92]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[93]  Tina Yu,et al.  Autonomous document classification for business , 1997, AGENTS '97.

[94]  Chris Buckley,et al.  Learning routing queries in a query zone , 1997, SIGIR '97.

[95]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[96]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[97]  David A. Hull The TREC-7 Filtering Track: Description and Analysis , 1998, Text Retrieval Conference.

[98]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[99]  Dieter Merkl,et al.  Text classification with self-organizing maps: Some lessons learned , 1998, Neurocomputing.

[100]  Leah S. Larkey,et al.  Automatic essay grading using text categorization techniques , 1998, SIGIR '98.

[101]  Dmitri Roussinov,et al.  A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation , 1998 .

[102]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[103]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[104]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[105]  Cornelis H. A. Koster,et al.  Four text classification algorithms compared on a Dutch corpus , 1998, SIGIR '98.

[106]  Dan Roth,et al.  Learning to Resolve Natural Language Ambiguities: A Unified Approach , 1998, AAAI/IAAI.

[107]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[108]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[109]  Wai Lam,et al.  A new on-line learning algorithm for adaptive text filtering , 1998, International Conference on Information and Knowledge Management.

[110]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[111]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[112]  Giuseppe Attardi,et al.  Categorisation by Context , 1998, J. Univers. Comput. Sci..

[113]  Dunja Mladenic,et al.  Word sequences as features in text-learning , 1998 .

[114]  Rainer Hoch,et al.  An experimental evaluation of OCR text representations for learning document classifiers , 1998, International Journal on Document Analysis and Recognition.

[115]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[116]  Dunja Mladenic,et al.  Turning {{\sc Yahoo!}}\ into an automatic Web page classifier , 1998 .

[117]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[118]  James Allan,et al.  Document classification using multiword features , 1998, CIKM '98.

[119]  Luis Alfonso Ureña López,et al.  Integrating linguistic resources in a uniform way for Text classification tasks , 1998, International Conference on Language Resources and Evaluation.

[120]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[121]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[122]  Fabio Crestani,et al.  “Is this document relevant?…probably”: a survey of probabilistic models in information retrieval , 1998, CSUR.

[123]  J Allan,et al.  Readings in information retrieval. , 1998 .

[124]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[125]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[126]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[127]  William W. Cohen,et al.  Context-sensitive learning methods for text categorization , 1999, TOIS.

[128]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[129]  Alexander Gammerman,et al.  Causal Models and Intelligent Data Management , 1999, Springer Berlin Heidelberg.

[130]  Leah S. Larkey,et al.  A patent search and classification system , 1999, DL '99.

[131]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[132]  Mounia Lalmas,et al.  A probabilistic description-oriented approach for categorizing web documents , 1999, CIKM '99.

[133]  Richard Forsyth,et al.  NEW DIRECTIONS IN TEXT CATEGORIZATION , 1999 .

[134]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[135]  Joo-Hwee Lim Learnable visual keywords for image classification , 1999, DL '99.

[136]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[137]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[138]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization , 1999, SIGIR 1999.

[139]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[140]  Jochen Dörre,et al.  Text mining: finding nuggets in mountains of textual data , 1999, KDD '99.

[141]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[142]  Fabio Crestani,et al.  Probabilistic learning for selective dissemination of information , 1999, Inf. Process. Manag..

[143]  Dik Lun Lee,et al.  Feature reduction for neural network based text categorization , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[144]  Hang Li,et al.  Text classification using ESC-based stochastic decision lists , 1999, CIKM '99.

[145]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[146]  Masahiko Haruno,et al.  Feature Selection in SVM Text Categorization , 1999, AAAI/IAAI.

[147]  Kevin Knight,et al.  Mining online text , 1999, Commun. ACM.

[148]  David Lewis,et al.  ATTICS: A Software Platform for Online Text Classification (poster abstract). , 1999, SIGIR 1999.

[149]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization (poster abstract) , 1999, SIGIR '99.

[150]  Vasileios Hatzivassiloglou,et al.  Text-Based Approaches for the Categorization of Images , 1999, ECDL.

[151]  Alessandro Sperduti,et al.  An improved boosting algorithm and its application to text categorization , 2000, CIKM '00.

[152]  Hongjun Lu,et al.  A Comparative Study of Classification Based Personal E-mail Filtering , 2000, PAKDD.

[153]  Byoung-Tak Zhang,et al.  Text filtering by boosting naive Bayes classifiers , 2000, SIGIR '00.

[154]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[155]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[156]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[157]  Ewan Klein,et al.  Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics , 2000, ACL 2000.

[158]  Lluís Màrquez i Villodre,et al.  Boosting Applied to Word Sense Disambiguation , 2000, ArXiv.

[159]  Lluís Màrquez i Villodre,et al.  Boosting Applied toe Word Sense Disambiguation , 2000, ECML.

[160]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[161]  Yoram Singer,et al.  Boosting for document routing , 2000, CIKM '00.

[162]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[163]  Vasileios Hatzivassiloglou,et al.  Text-based approaches for non-topical image categorization , 2000, International Journal on Digital Libraries.

[164]  Marilyn A. Walker,et al.  A Boosting Approach to Topic Spotting on Subdialogues , 2000, ICML.

[165]  Daniel R. Tauritz,et al.  Adaptive Information Filtering using Evolutionary Computation , 2000, Inf. Sci..

[166]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[167]  Patrick Gallinari,et al.  HMM-based passage models for document classification and ranking , 2001 .

[168]  Giovanni Soda,et al.  Text categorization for multi-page documents: a hybrid naive Bayes HMM approach , 2001, JCDL '01.

[169]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[170]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[171]  Ron Bekkerman,et al.  Distributional clustering of words for text categorization , 2003 .

[172]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[173]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[174]  Giovanni Soda,et al.  Hidden Markov Models for Text Categorization in Multi-Page Documents , 2002, Journal of Intelligent Information Systems.

[175]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[176]  Andreas S. Weigend,et al.  Exploiting Hierarchy in Text Categorization , 1999, Information Retrieval.

[177]  Thorsten Joachims,et al.  Guest Editors' Introduction to the Special Issue on Automated Text Categorization , 2002, Journal of Intelligent Information Systems.

[178]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[179]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.