Aplicación de transductores de estado-finito a los procesos de unificación de términos (Application of transducers of finite state to unification processes of term variants)

Application of transducers of state-finite to unification processes of term variants. An approach based on techniques of state-finite has applied to the processes of unification of terms in Spanish. The algorithms of conflation are computational procedures utilized in some Information Retrieval (RI) systems for the unification of term variants, semantically equivalent, to a normalized form. The programs that carry out habitually this process are called: stemmers and lematizadores. The objective of this work is to evaluate the deficiencies and errors of the lemmatizers in the conflation of terms. The method utilized for the construction of the lemmatizer has been based on the implementation of a linguistic tool that allows to build electronic dictionaries represented internally in Finite-State Transducers (FST). The lexical resources developed have been applied to a corpus of verification to evaluate the performance of these lexical parsers. The metric of evaluation utilized has been an adaptation of coverage and precision measures. The results show that the main limitation of unification processes of term variants through technology of state-finite is the infra-analysis.

[1]  Peter Willett,et al.  Applications of n-grams in textual information systems , 1998, J. Documentation.

[2]  C. Douglas Johnson,et al.  Formal Aspects of Phonological Description , 1972 .

[3]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[4]  Max Silberztein,et al.  Text Indexation with INTEX , 1999, Comput. Humanit..

[5]  Yves Schabes,et al.  Deterministic Part-of-Speech Tagging with Finite-State Transducers , 1995, Comput. Linguistics.

[6]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[7]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[8]  P. H. Matthews,et al.  Morphology: An Introduction to the Theory of Word-Structure , 1974 .

[9]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[10]  Chris D. Paice,et al.  Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[11]  Fernando Pereira,et al.  Sentence modeling and parsing , 1997 .

[12]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[13]  Hinrich Schütze,et al.  Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks , 1996, TREC.

[14]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[15]  Peter Willett,et al.  An evaluation of some conflation algorithms for information retrieval , 1981 .

[16]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[17]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[18]  Chris D. Paice Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[19]  Evelyne Tzoukermann,et al.  NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax , 1999 .

[20]  Lauri Karttunen Constructing Lexical Transducers , 1994, COLING.

[21]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[22]  P. H. Matthews,et al.  Morphology: An Introduction to the Theory of Word-Structure , 1975 .

[23]  J. Turmo,et al.  An Environment for Morphosyntactic Processing of UnrestrictedSpanish , 1998 .

[24]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[25]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[26]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[27]  守屋 悦朗,et al.  J.E.Hopcroft, J.D. Ullman 著, "Introduction to Automata Theory, Languages, and Computation", Addison-Wesley, A5変形版, X+418, \6,670, 1979 , 1980 .

[28]  Mehryar Mohri,et al.  On some applications of finite-state automata theory to natural language processing , 1996, Nat. Lang. Eng..

[29]  Félix de Moya Anegón,et al.  Term conflation methods in information retrieval: Non‐linguistic and linguistic approaches , 2005 .

[30]  George W. Adamson,et al.  The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[31]  James Allan,et al.  INQUERY at TREC-5 , 1996, TREC.

[32]  Karen Spärck Jones,et al.  Automatic Search Term variant Generation , 1984, J. Documentation.

[33]  Jacques Savoy,et al.  Stemming of French Words Based on Grammatical Categories , 1993, J. Am. Soc. Inf. Sci..

[34]  Sergi Cervell,et al.  Morphosyntactic analysis and parsing of unrestricted Spanish text , 1998 .

[35]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[36]  Lauri Karttunen,et al.  Two-Level Morphology with Composition , 1992, COLING.