A Linear Least Squares Fit Mapping Method for Information Retrieval From Natural Language Texts

This paper describes a unique method for mapping natural language texts to canonical terms that identify the contents of the texts. This method learns empirical associations between free-form texts and canonical terms from human-assigned matches and determines a Linear Least Squares Fit (LLSF) mapping function which represents weighted connections between words in the texts and the canonical terms. The mapping function enables us to project an arbitrary text to the canonical term space where the "transformed" text is compared with the terms, and similarity scores are obtained which quantify the relevance between the the text and the terms. This approach has superior power to discover synonyms or related terms and to preserve the context sensitivity of the mapping. We achieved a rate of 84% in both the recall and the precision with a testing set of 6,913 texts, outperforming other techniques including string matching (15%), morphological parsing (17%) and statistical weighting (21%).

[1]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[2]  C. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[3]  P. Loy International Classification of Diseases--9th revision. , 1978, Medical record and health care information journal.

[4]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[5]  James L. McClelland,et al.  James L. McClelland, David Rumelhart and the PDP Research Group, Parallel distributed processing: explorations in the microstructure of cognition . Vol. 1. Foundations . Vol. 2. Psychological and biological models . Cambridge MA: M.I.T. Press, 1987. , 1989, Journal of Child Language.

[6]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[7]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[8]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[9]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[10]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[11]  C G Chute,et al.  Latent Semantic Indexing of medical diagnoses using UMLS semantic structures. , 1991, Proceedings. Symposium on Computer Applications in Medical Care.

[12]  D A Evans,et al.  Automatic Indexing of Abstracts via Natural-language Processing Using a Simple Thesaurus , 1991, Medical decision making : an international journal of the Society for Medical Decision Making.

[13]  C G Chute,et al.  An evaluation of concept based latent semantic indexing for clinical information retrieval. , 1992, Proceedings. Symposium on Computer Applications in Medical Care.