Comparing corpora and lexical ambiguity

In this paper we compare two types of corpus, focusing on the lexical ambiguity of each of them. The first corpus consists mainly of newspaper articles and literature excerpts, while the second belongs to the medical domain. To conduct the study, we have used two different disambiguation tools. However, first of all, we must verify the performance of each system in its respective application domain. We then use these systems in order to assess and compare both the general ambiguity rate and the particularities of each domain. Quantitative results show that medical documents are lexically less ambiguous than unrestricted documents. Our conclusions show the importance of the application area in the design of NLP tools.

[1]  William R. Hersh,et al.  Information retrieval at the millenium , 1998, AMIA.

[2]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[3]  Eric Wehrli,et al.  The Ips System , 1992, COLING.

[4]  Pierre Lafon,et al.  TyPTex: Inductive Typological Text Classification by Multivariate Statistical Analysis for NLP Systems Tuning/Evaluation , 2000, LREC.

[5]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[6]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .

[7]  Robert H. Baud,et al.  Indexing by statistical tagging , 2000 .

[8]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[9]  Patrick Paroubek,et al.  The GRACE french part-of-speech tagging evaluation task , 1998, LREC.

[10]  Robert H. Baud,et al.  Minimal Commitment and Full Lexical Disambiguation: Balancing Rules and Hidden Markov Models , 2000, CoNLL/LLL.

[11]  William R. Hersh,et al.  A Large-Scale Comparison of Boolean vs. Natural Language Searching for the TREC-7 Interactive Track , 1998, TREC.

[12]  Nancy Ide,et al.  MULTEXT: Multilingual Text Tools and Corpora , 1994, COLING.

[13]  Jean-Pierre Chanod,et al.  Tagging French - comparing a statistical and a constraint-based method , 1995, EACL.

[14]  Adam Kilgarriff,et al.  Which words are particularly characteristic of a text? a survey of statistical approaches , 1996 .

[15]  Yves Schabes,et al.  The Lexical Analysis of Natural Languages , 1997 .

[16]  Christian Lovis,et al.  Morphosemantems Decomposition and Semantic Representation to Allow Fast and Efficient Natural Language Recognition , 1997, AMIA.

[17]  R Clark,et al.  Natural Language Processing, Lexicon and Semantics , 1995, Methods of Information in Medicine.

[18]  Jacques Bouaud,et al.  Corpus-based identification and refinement of semantic classes , 1997, AMIA.

[19]  Judith C. Wagner,et al.  MEDTAG: tag-like semantics for medical document indexing , 1999, AMIA.