FASTR : A Unification-Based Front-End to Automatic Indexing

Most natural language processing approaches to full-text information retrieval are based on indexing documents by the occurrences of controlled terms they contain. An important problem with this approach is that terms accept numerous variations, and can therefore cause many documents not to be retrieved although being relevant. For example, "myeloid leukaemia cells" and "myeloid and erythoid cell" are two occurrences of "myeloid cell" which cannot be detected without an account of local morpho-syntactic variations. In this paper, we present a linguistic analysis of the observed variations and a three-tier constraint-based formalism for representing them. This technique has been implemented and results in FASTR, a natural language processing tool that extracts terms and their variants from full-text documents. We justify the choice of a unification-based formalism by its expressivity and by the addition of conceptual and computational devices which make the parser computationally tractable. Contrary to the generally accepted idea, high quality natural language processing through unification and industrial requirements can fit together, provided that the application is carefully designed in order to control and minimize data accesses and computation times. The effectiveness of FASTR for extracting correct occurrences is supported by experiments on two English corpora of scientific abstracts and a list of 71,623 controlled terms. We report that an account of three kinds of variants (insertions, permutations and coordinations) increases recall by 16.7% without altering precision.

[1]  Michael L. Mauldin,et al.  Conceptual Information Retrieval: A Case Study in Adaptive Partial Parsing , 1991 .

[2]  Gerald Salton,et al.  Automatic text processing , 1988 .

[3]  Didier Bourigault,et al.  An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation , 1993, EACL.

[4]  Christian Jacquemin,et al.  Retrieving terms and their variants in a lexicalized unification-based framework , 1994, SIGIR '94.

[5]  Tomek Strzalkowski,et al.  Information Retrieval Using Robust Natural Language Processing , 1992, HLT.

[6]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[7]  Hideto Tomabechi Quasi-Destructive Graph Unification , 1991, ACL.

[8]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[9]  Susanne Preuß,et al.  Direct Parsing With Metarules , 1992, COLING.

[10]  Karen Spärck Jones,et al.  Automatic Search Term variant Generation , 1984, J. Documentation.

[11]  Rebecca N. Wright,et al.  Finite-State Approximation of Phrase Structure Grammars , 1991, ACL.

[12]  Christian Jacquemin A Coincidence Detection Network for Spatio-Temporal Coding: Application to Nominal Composition , 1993, IJCAI.

[13]  Gerard Salton,et al.  On the application of syntactic methodologies in automatic text analysis , 1989, SIGIR '89.

[14]  Aravind K. Joshi,et al.  Parsing Strategies with ‘Lexicalized’ Grammars: Application to Tree Adjoining Grammars , 1988, COLING.

[15]  Douglas E. Appelt,et al.  FASTUS: A Finite-state Processor for Information Extraction from Real-world Text , 1993, IJCAI.

[16]  K. Vijay-Shanker,et al.  Using Descriptions of Trees in a Tree Adjoining Grammar , 1992, Comput. Linguistics.

[17]  Christian Jacquemin Optimizing the Computational Lexicalization of Large Grammars , 1994, ACL.

[18]  Max Silberztein,et al.  Dictionnaires électroniques et analyse automatique de textes : le système intex , 1993 .

[19]  Allen Ginsberg,et al.  A unified approach to automatic indexing and information retrieval , 1993, IEEE Expert.

[20]  Stuart M. Shieber,et al.  Constraint-based grammar formalisms - parsing and type inference for natural and computer languages , 1992 .

[21]  Stuart M. Shieber,et al.  An Introduction to Unification-Based Approaches to Grammar , 1986, CSLI Lecture Notes.

[22]  Mary Hart,et al.  Automatic indexing using selective NLP and first-order thesauri , 1991, RIAO.

[23]  B. Daille Approche mixte pour l'extraction de terminologie : statistique lexicale et filtres linguistiques , 1994 .