An example-based mapping method for text categorization and retrieval

A unified model for text categorization and text retrieval is introduced. We use a training set of manually categorized documents to learn word-category associations, and use these associations to predict the categories of arbitrary documents. Similarly, we use a training set of queries and their related documents to obtain empirical associations between query words and indexing terms of documents, and use these associations to predict the related documents of arbitrary queries. A Linear Least Squares Fit (LLSF) technique is employed to estimate the likelihood of these associations. Document collections from the MEDLINE database and Mayo patient records are used for studies on the effectiveness of our approach, and on how much the effectiveness depends on the choices of training data, indexing language, word-weighting scheme, and morphological canonicalization. Alternative methods are also tested on these data collections for comparison. It is evident that the LLSF approach uses the relevance information effectively within human decisions of categorization and retrieval, and achieves a semantic mapping of free texts to their representations in an indexing language. Such a semantic mapping lead to a significant improvement in categorization and retrieval, compared to alternative approaches.

[1]  C. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[2]  Gene H. Golub,et al.  Matrix computations , 1983 .

[3]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[4]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[5]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[6]  K. A. McKibbon,et al.  Online access to medline in clinical settings , 2020 .

[7]  K. A. McKibbon,et al.  Online access to MEDLINE in clinical settings. A study of use and usefulness. , 1990, Annals of internal medicine.

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[10]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[11]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[12]  D A Evans,et al.  Automatic Indexing of Abstracts via Natural-language Processing Using a Simple Thesaurus , 1991, Medical decision making : an international journal of the Society for Medical Decision Making.

[13]  Norbert Fuhr,et al.  AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.

[14]  W R Hersh,et al.  Words, concepts, or both: optimal indexing units for automated information retrieval. , 1992, Proceedings. Symposium on Computer Applications in Medical Care.

[15]  Yiming Yang,et al.  A Linear Least Squares Fit Mapping Method for Information Retrieval From Natural Language Texts , 1992, COLING.

[16]  C G Chute,et al.  An evaluation of concept based latent semantic indexing for clinical information retrieval. , 1992, Proceedings. Symposium on Computer Applications in Medical Care.

[17]  C G Chute,et al.  Words or concepts: the features of indexing units and their optimal use in information retrieval. , 1993, Proceedings. Symposium on Computer Applications in Medical Care.

[18]  Yiming Yang,et al.  An application of least squares fit mapping to text information retrieval , 1993, SIGIR.

[19]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[20]  淑子 佐藤,et al.  参考係における Medical Subject Headings (MeSH) の利用 , 1996 .