Text categorization: the assignment of subject descriptors to magazine articles

Abstract Automatic text categorization is an important research area and has a potential for many text-based applications including text routing and filtering. Typical text classifiers learn from example texts that are manually categorized. When categorizing magazine articles with broad subject descriptors, we study three aspects of text classification: (1) effective selection of feature words and proper names that reflect the main topics of the text; (2) learning algorithms; and (3) improvement of the quality of the learned classifier by selection of examples. The χ 2 test, which is sometimes used for selecting terms that are highly related to a text class, is applied in a novel way when constructing a category weight vector. Despite a limited number of training examples, combining an effective feature selection with the χ 2 learning algorithm for training the text classifier results in an adequate categorization of new magazine articles.

[1]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[2]  William W. Cohen Text Categorization and Relational Learning , 1995, ICML.

[3]  Marie-Francine Moens,et al.  Automatic Indexing and Abstracting of Document Texts , 2000, Computational Linguistics.

[4]  RiloffEllen,et al.  Information extraction as a basis for high-precision text classification , 1994 .

[5]  Georges Hébrail,et al.  Automatic document classification: natural language processing, statistical analysis, and expert system techniques used together , 1992, SIGIR '92.

[6]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[7]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[8]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[9]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[10]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[11]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[12]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[13]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[14]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[15]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[16]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[17]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[18]  Donna Harman,et al.  Information Processing and Management , 2022 .

[19]  Kevin D. Ashley,et al.  Finding factors: learning to classify case opinions under abstract fact categories , 1997, ICAIL '97.

[20]  Kok F. Lai,et al.  Document Routing by Discriminant Projection: TREC-4 , 1995, TREC.

[21]  Maristella Agosti,et al.  Information Retrieval and Hypertext , 1996, Information Retrieval and Hypertext.

[22]  Yoshimi Suzuki,et al.  Keyword extraction of radio news using term weighting with an encyclopedia and newspaper articles , 1998, SIGIR '98.

[23]  Norbert Fuhr,et al.  Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[24]  Marie-Francine Moens,et al.  Automatic text structuring and categorization as a first step in summarizing legal cases , 1997, Inf. Process. Manag..

[25]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[26]  Mountaz Zizi,et al.  Interactive Dynamic Maps for Visualisation and Retrieval from Hypertext Systems , 1996 .

[27]  Bert R. Boyce,et al.  Online information retrieval concepts, principles, and techniques , 1987, J. Am. Soc. Inf. Sci..

[28]  Chris Buckley,et al.  Learning routing queries in a query zone , 1997, SIGIR '97.

[29]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[30]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[31]  Maristella Agosti,et al.  An Overview of Hypertext , 1996 .

[32]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[33]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[34]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[35]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[36]  Hinrich Schütze,et al.  Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks , 1996, TREC.

[37]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[38]  Teun A. Van Dijκ Structures of News in the Press , 1985 .

[39]  George W. Furnas,et al.  Pictures of relevance: A geometric analysis of similarity measures , 1987, J. Am. Soc. Inf. Sci..

[40]  T. V. Dijk Discourse and communication : new approaches to the analysis of mass media discourse and communication , 1985 .

[41]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[42]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[43]  NgHwee Tou,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997 .

[44]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.