Research Paper: Corpus-based Statistical Screening for Phrase Identification

Purpose: The authors study the extraction of useful phrases from a natural language database by statistical methods. The aim is to leverage human effort by providing preprocessed phrase lists with a high percentage of useful material. Method: The approach is to develop six different scoring methods that are based on different aspects of phrase occurrence. The emphasis here is not on lexical information or syntactic structure but rather on the statistical properties of word pairs and triples that can be obtained from a large database. Measurements: The Unified Medical Language System (UMLS) incorporates a large list of humanly acceptable phrases in the medical field as a part of its structure. The authors use this list of phrases as a gold standard for validating their methods. A good method is one that ranks the UMLS phrases high among all phrases studied. Measurements are 11-point average precision values and precision-recall curves based on the rankings. Result: The authors find of six different scoring methods that each proves effective in identifying UMLS quality phrases in a large subset of MEDLINE. These methods are applicable both to word pairs and word triples. All six methods are optimally combined to produce composite scoring methods that are more effective than any single method. The quality of the composite methods appears sufficient to support the automatic placement of hyperlinks in text at the site of highly ranked phrases. Conclusion: Statistical scoring methods provide a promising approach to the extraction of useful phrases from a natural language database for the purpose of indexing or providing hyperlinks in text.

[1]  Gregory Grefenstette,et al.  CLARIT TREC Design, Experiments, and Results , 1992, TREC.

[2]  Allen C. Browne,et al.  Lexical methods for managing variation in biomedical terminologies. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[3]  Bruce R. Schatz,et al.  Extracting noun phrases for all of MEDLINE , 1999, AMIA.

[4]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[5]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[6]  Clement T. Yu,et al.  Automatic indexing using term discrimination and term precision measurements , 1976, Information Processing & Management.

[7]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[8]  H. J. Larson Introduction to Probability Theory and Statistical Inference , 1970 .

[9]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[10]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[11]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[12]  S. T. Dumais,et al.  Human factors and behavioral science: Statistical semantics: Analysis of the potential performance of key-word information systems , 1983, The Bell System Technical Journal.

[13]  Shmuel T. Klein,et al.  Detecting Content-Bearing Words by Serial Clustering. , 1995, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[14]  Marcia J. Bates,et al.  Subject access in online catalogs: A design model , 1986 .

[15]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[16]  Gerard Salton,et al.  The State of Retrieval System Evaluation , 1992, Inf. Process. Manag..

[17]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[18]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[19]  Randolph A. Miller,et al.  Research Paper: An Experiment Comparing Lexical and Statistical Methods for Extracting MeSH Terms from Clinical Free Text , 1998, J. Am. Medical Informatics Assoc..

[20]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[21]  Steven Finch,et al.  Partial orders for document representation: a new methodology for combining document features , 1995, SIGIR '95.

[22]  Marcia J. Bates,et al.  Indexing and Access for Digital Libraries and the Internet: Human, Database, and Domain Factors , 1998, J. Am. Soc. Inf. Sci..

[23]  H. J. Larson,et al.  Introduction to Probability Theory and Statistical Inference. (3rd ed.) , 1983 .

[24]  D A Evans,et al.  Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[25]  Marcia J. Bates,et al.  Rethinking Subject Cataloging in the Online Environment. , 1989 .

[26]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[27]  Eric Brill,et al.  Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach , 1993, ACL.

[28]  Marius Fieschi,et al.  UMLS-based conceptual queries to biomedical information databases: an overview of the project ARIANE. Unified Medical Language System. , 1998, Journal of the American Medical Informatics Association : JAMIA.

[29]  Donna Harman,et al.  The First Text REtrieval Conference (TREC-1) , 1993 .

[30]  H. P. Luhn A new method of recording and searching information , 1953 .

[31]  Marius Fieschi,et al.  Model Formulation: UMLS-based Conceptual Queries to Biomedical Information Databases: An Overview of the Project ARIANE , 1998, J. Am. Medical Informatics Assoc..

[32]  Susan T. Dumais,et al.  Statistical semantics: analysis of the potential performance of keyword information systems , 1984 .

[33]  Louis M. Gomez,et al.  All the Right Words: Finding What You Want as a Function of Richness of Indexing Vocabulary. , 1990 .

[34]  Shmuel T. Klein,et al.  Clumping properties of content-bearing words , 1998 .

[35]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[36]  Morris Rubinoff,et al.  Statistical generation of a technical vocabulary , 1968 .

[37]  Joel L. Fagan The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[38]  Fred J. Damerau,et al.  An experiment in automatic indexing , 1965 .

[39]  James F. Allen Natural language understanding , 1987, Bejnamin/Cummings series in computer science.

[40]  W. Bruce Croft,et al.  Term clustering of syntactic phrases , 1989, SIGIR '90.

[41]  Sridhar Radhakrishnan,et al.  INDEX: The statistical basis for an automatic conceptual phrase-indexing system , 1990 .

[42]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[43]  Karen Spärck Jones,et al.  Natural language processing for information retrieval , 1996, CACM.

[44]  W. John Wilbur,et al.  An information measure of retrieval performance , 1992, Inf. Syst..

[45]  Tomek Strzalkowski,et al.  Document indexing and retrieval using natural language processing , 1994 .

[46]  Tomek Strzalkowski,et al.  Information Retrieval Using Robust Natural Language Processing , 1992, HLT.

[47]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.