A mining-based category evolution approach to managing online document categories

With rapid expansion of the numbers and sizes of text repositories and improvements in global connectivity, the quantity of information available online as free-format text is growing exponentially. Many large organizations create and maintain huge volumes of textual information online, and there is a pressing need for support of efficient and effective information retrieval, filtering, and management. Text categorization, or the assignment of textual documents to one or more pre-defined categories based on their content, is an essential component of efficient management and retrieval of documents. Previously, research has focused predominantly on developing or adopting statistical classification or inductive learning methods for automatically discovering text categorization patterns for a pre-defined set of categories. However, as documents accumulate, such categories may not capture a document's characteristics correctly. In this study, we propose a mining-based category evolution (MiCE) technique to adjust document categories based on existing categories and their associated documents. Empirical evaluation results indicate that the proposed technique, MiCE, was more effective than the category discovery approach and was insensitive to the quality of original categories.

[1]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[2]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[3]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[4]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[5]  Ellen Riloff,et al.  Text databases and information retrieval , 1996, CSUR.

[6]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[7]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[8]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[9]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[10]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[11]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[12]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[13]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[14]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[15]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[16]  R McKeownKathleen,et al.  Translating collocations for bilingual lexicons , 1996 .

[17]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[18]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[19]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[20]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[21]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[22]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[23]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[24]  Robert F. Ling,et al.  Cluster analysis algorithms for data reduction and classification of objects , 1981 .