Reducing Information Variation in Text

We discuss the nature and the scope of linguistic (morphological, syntactic and semantic) variation of terms and its impact on two information retrieval tasks: term acquisition and automatic indexing. A review of natural language processing techniques existing in these two areas is done, along with an in-depth presentation of FASTR, a corpus processor for the recognition, normalization, and acquisition of multi-word terms.

[1]  András Kornai Extended finite state models of language , 1996, Nat. Lang. Eng..

[2]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[3]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[4]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[5]  M. Teresa Cabré Castellví,et al.  Automatic term detection: A review of current systems , 2001 .

[6]  Michel Mathieu-Colas Orthographe et informatique : tablissement d'un dictionnaire lectronique des variantes graphiques , 1990 .

[7]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[8]  Harry Bunt,et al.  Recent Advances in Parsing Technology , 1996 .

[9]  Christian Jacquemin,et al.  Syntagmatic and Paradigmatic Representations of Term Variation , 1999, ACL.

[10]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[11]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[12]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[13]  Andy Lauriston Automatic recognition of complex terms: Problems and the TERMINO solution , 1994 .

[14]  Yves Schabes,et al.  Finite-State Morphology: Inflections and Derivations in a Single Framework Using Dictionaries and Rules , 1997 .

[15]  Didier Bourigault,et al.  An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation , 1993, EACL.

[16]  Alan F. Smeaton,et al.  The Application of Morpho-Syntactic Language Processing to Effective Phrase Matching , 1992, Inf. Process. Manag..

[17]  Hsin-Hsi Chen,et al.  Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation , 1994, ACL.

[18]  Geoffrey K. Pullum,et al.  Generalized Phrase Structure Grammar , 1985 .

[19]  Agata Savary Recensement et description des mots composés - méthodes et applications , 2000 .

[20]  守屋 悦朗,et al.  J.E.Hopcroft, J.D. Ullman 著, "Introduction to Automata Theory, Languages, and Computation", Addison-Wesley, A5変形版, X+418, \6,670, 1979 , 1980 .

[21]  Judith L. Klavans,et al.  Computer Methods for Morphological Analysis , 1986, ACL.

[22]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[23]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[24]  Martin Kay,et al.  Algorithm schemata and data structures in syntactic processing , 1986 .

[25]  William A. Woods,et al.  Natural Language Technology in Precision Content Retrieval , 1998 .

[26]  Yves Schabes,et al.  Parsing with Finite-State Transducers , 1997 .

[27]  Maurice Gross,et al.  Grammaire transformationnelle du francais: syntaxe du nom , 1979 .

[28]  E. Michael Keen,et al.  ON THE GENERATION AND SEARCHING OF ENTRIES IN PRINTED SUBJECT INDEXES , 1977 .

[29]  Julio Gonzalo,et al.  Lexical ambiguity and Information Retrieval revisited , 1999, EMNLP.

[30]  Tomek Strzalkowski,et al.  Robust Text Processing in Automated Information Retrieval , 1994, ANLP.

[31]  G. Gross Degré de figement des noms composés , 1988 .

[32]  Christian Jacquemin Optimizing the Computational Lexicalization of Large Grammars , 1994, ACL.

[33]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[34]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[35]  Pim van der Eijk Automating the Acquisition of Bilingual Terminology , 1993, EACL.

[36]  Karen Spärck Jones,et al.  Readings in natural language processing , 1986 .

[37]  Mary Hart,et al.  Automatic indexing using selective NLP and first-order thesauri , 1991, RIAO.

[38]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[39]  Eric Laporte Rational Transductions for Phonetic Conversion and Phonology , 1997 .

[40]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[41]  Mark Liberman,et al.  A Finite-State Morphological Processor For Spanish , 1990, COLING.

[42]  Stuart M. Shieber,et al.  An Introduction to Unification-Based Approaches to Grammar , 1986, CSLI Lecture Notes.

[43]  Stephanie W. Haas,et al.  Constituent object parsing for information retrieval and similar text processing problems , 1989 .

[44]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[45]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[46]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[47]  Alan F. Smeaton,et al.  Using morpho-syntactic language analysis in phrase matching , 1991, RIAO.

[48]  Karen Spärck Jones,et al.  Automatic Search Term variant Generation , 1984, J. Documentation.

[49]  Jean-Marie Pierrel,et al.  Ingénierie des langues , 2000 .

[50]  ChengXiang Zhai,et al.  Noun-Phrase Analysis in Unrestricted Text for Information Retrieval , 1996, ACL.

[51]  John E. Hopcroft,et al.  An n log n algorithm for minimizing states in a finite automaton , 1971 .

[52]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[53]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[54]  Joel L. Fagan,et al.  Automatic Phrase Indexing for Document Retrieval: An Examination of Syntactic and Non-Syntactic Methods , 1987, SIGIR.

[55]  B. Daille Approche mixte pour l'extraction de terminologie : statistique lexicale et filtres linguistiques , 1994 .

[56]  Karen A. Frenkel,et al.  The human genome project and informatics , 1991, CACM.

[57]  Evelyne Tzoukermann,et al.  Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax , 1997, ACL.

[58]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[59]  Evelyne Tzoukermann,et al.  NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax , 1999 .

[60]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[61]  Benoît Habert OLMES: a versatile and extensible natural language parser in CLOS , 1991 .

[62]  Avi Arampatzis,et al.  Phrase-based Information Retrieval , 1998 .

[63]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[64]  Ivan A. Sag,et al.  Information-Based Syntax and Semantics: Volume 1, Fundamentals , 1987 .

[65]  Frank Smadja,et al.  Xtract: An overview , 1992, Comput. Humanit..

[66]  B. Courtois,et al.  Un système de dictionnaires électroniques pour les mots simples du français , 1990 .

[67]  Didier Bourigault,et al.  LEXTER, a Natural Language Processing Tool for Terminology Extraction , 1996 .

[68]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[69]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[70]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[71]  William A. Woods,et al.  Conceptual Indexing: A Better Way to Organize Knowledge , 1997 .

[72]  Tamás Gaál Is this Finite-State Transducer Sequentiable? , 2001, CIAA.

[74]  Avi Arampatzis,et al.  Phase-Based Information Retrieval , 1998, Inf. Process. Manag..

[75]  Srinivas Bangalore,et al.  Lexicalization and Grammar Development , 1994, ArXiv.

[76]  Sophia Ananiadou,et al.  Extracting Nested Collocations , 1996, COLING.

[77]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[78]  Pascale Ngan Fung Using word signature features for terminology translation from large corpora , 1997 .

[79]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[80]  Max Silberztein,et al.  Dictionnaires électroniques et analyse automatique de textes : le système intex , 1993 .

[81]  Christoph Schwarz,et al.  Automatic syntactic analysis of free text , 1990, J. Am. Soc. Inf. Sci..

[82]  Christoph Schwarz Content based text handling , 1990, Inf. Process. Manag..

[83]  ChengXiang Zhai,et al.  Fast Statistical Parsing of Noun Phrases for Document Indexing , 1997, ANLP.

[84]  Christian Jacquemin,et al.  Term Extraction + Term Clustering: An Integrated Platform for Computer-Aided Terminology , 1999, EACL.

[85]  Bruce W. Watson,et al.  Incremental construction of minimal acyclic finite state automata , 2000, CL.

[86]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[87]  Satoshi Shirai,et al.  A Statistical Method for Extracting Uninterrupted and Interrupted Collocations from Very Large Corpora , 1996, COLING.

[88]  Evelyne Tzoukermann,et al.  A NATURAL LANGUAGE APPROACH TO MULTI-WORD TERM CONFLATION , 1997 .

[89]  Martin Dillon,et al.  FASIT: A fully automatic syntactically based indexing system , 1983, J. Am. Soc. Inf. Sci..

[90]  Eric Laporte,et al.  Elimination of lexical ambiguities by grammars: The ELAG system , 2000 .

[91]  Aravind K. Joshi,et al.  Parsing Strategies with ‘Lexicalized’ Grammars: Application to Tree Adjoining Grammars , 1988, COLING.

[92]  Fiammetta Namer,et al.  Construire un lexique dérivationnel : théorie et réalisations * , 1999 .

[93]  Juan C. Sager,et al.  A practical course in terminology processing , 1990 .

[94]  Xavier Polanco,et al.  In vitro evaluation of a program for machine-aided indexing , 2002, Inf. Process. Manag..

[95]  Naomi Sager,et al.  Natural Language Information Processing: A Computer Grammar of English and Its Applications , 1980 .

[96]  Louis Guilbert La formation du vocabulaire de l'aviation , 1965 .

[97]  Chantal Enguehard,et al.  Automatic Natural Acquisition of a Terminology , 1995, J. Quant. Linguistics.

[98]  Thierry Hamon,et al.  A Step towards the Detection of Semantic Variants of Terms in Technical Documents , 1998, COLING-ACL.

[99]  Karen Spärck Jones,et al.  A Natural Language Front End to Databases with Evaluative Feedback , 1983, ICOD-2 Workshop on New Applications of Data Bases.

[100]  Tomek Strzalkowski,et al.  Information Retrieval Using Robust Natural Language Processing , 1992, HLT.

[101]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[102]  Éric Gaussier Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora , 1998, COLING-ACL.

[103]  Borivoj Melichar,et al.  On the Size of Deterministic Finite Automata , 2001, CIAA.

[104]  George E. Heidorn Augmented phrase structure grammars , 1975, TINLAP '75.

[105]  Benoît Habert,et al.  Noms composés, termes, dénominations complexes: problématiques linguistiques et traitements automatiques , 1993 .

[106]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[107]  M. Baltin,et al.  The Mental representation of grammatical relations , 1985 .

[108]  Masaru Tomita Current Issues in Parsing Technology , 1990 .

[109]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[110]  Stephanie W. Haas,et al.  Conjunction, ellipsis, and other discontinuous constituents in the constituent object parser , 1990, Inf. Process. Manag..

[111]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[112]  Karen Spärck Jones,et al.  Linguistically Motivated Descriptive Term Selection , 1984, COLING.

[113]  Mehryar Mohri Compact Representations by Finite-State Transducers , 1994, ACL.

[114]  Sayori Shimohata,et al.  Retrieving Collocations by Co-Occurrences and Word Order Constraints , 1997, ACL.

[115]  Aravind K. Joshi,et al.  Parsing with Lexicalized Tree Adjoining Grammar , 1991 .

[116]  Avi Arampatzis,et al.  IRENA: Information Retrieval Engine based on Natural Language Analysis , 1997, RIAO.

[117]  Tomek Strzalkowski,et al.  Evaluation of the Tagged Text Parser , 1996 .

[118]  Ivan A. Sag,et al.  Information-based syntax and semantics , 1987 .