Web-Based Sources for an Annotated Corpus Building and Composite Proper Name Identification

Nowadays, collections of texts with annotations on several levels are useful resources. Huge efforts are required to develop this resource for languages like Spanish. In this work, we present the initial step, lexical level annotation, for the compilation of an annotated Mexican corpus using Web-based sources. We also describe a method based on heterogeneous knowledge and simple Web-based sources for the proper name identification required in such annotation. We focused our work on composite entities (names with coordinated constituents, names with several prepositional phrases, and names of songs, books, movies, etc.). The preliminary obtained results are presented.

[1]  Andrei Mikheev,et al.  Periods, Capitalized Words, etc. , 2002, CL.

[2]  R. Burchfield Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561 , 1985 .

[3]  Adwait Ratnaparkhi Statistical Models for Unsupervised Prepositional Phrase Attachment , 1998, COLING.

[4]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[5]  Gregory Grefenstette,et al.  Web as Corpus , 2003 .

[6]  Tullio de Mauro On Lexicon and Grammar , 2006 .

[7]  Anna Herwig,et al.  Lexicon and Grammar , 2005 .

[8]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[9]  Alexander F. Gelbukh,et al.  Compilation of a Spanish Representative Corpus , 2002, CICLing.

[10]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[11]  Daniel Jurafsky,et al.  How Verb Subcategorization Frequencies Are Affected By Corpus Choice , 1998, COLING.

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Douglas Biber,et al.  Using Register-Diversified Corpora for General Language Studies , 1993, Comput. Linguistics.

[14]  Sergi Cervell,et al.  An environment for mophosyntactic processing of unrestricted Spanish text , 1998 .

[15]  Alexander F. Gelbukh,et al.  Stable Coordinated Pairs in Text Processing , 2003, TSD.