The American National Corpus: A Standardized Resource for American English

Linguistic research has become heavily reliant on text corpora over the past ten years. Such resources are becoming increasingly available through efforts such as the Linguistic Data Consortium (LDC) in the US and the European Language Resources Association (ELRA) in Europe. However, in the main the corpora that are gathered and distributed through these and other mechanisms consist of texts which can be easily acquired and are available for re-distribution without undue problems of copyright, etc. This practice has resulted in a vast over-representation among available corpora of certain genres, in particular newspaper samples, which comprise the greatest percentage of texts currently available from, for example, the LDC, and which also dominate the training data available for speech recognition purposes. Other available corpora typically consist of technical reports, transcriptions of parliamentary and other proceedings, short telephone conversations, and the like. The upshot of this is that corpusbased natural language processing has relied heavily on language samples representative of usage in a handful of limited and linguistically specialized domains.

[1]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[2]  John Algeo,et al.  British and American Grammatical Differences , 1988 .

[3]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[5]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[6]  Satoshi Sekine A New Direction for Sublanguage N. L. P. , 1995 .

[7]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[8]  Nicoletta Calzolari,et al.  EAGLES Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A Common Proposal and Applications to European Languages , 1996 .

[9]  Satoshi Sekine,et al.  The Domain Dependence of Parsing , 1997, ANLP.

[10]  Nancy Ide,et al.  Corpues enconding standard: SGML guidelines for encoding linguistic corpora , 1998, LREC.

[11]  Nancy Ide,et al.  An American national corpus: a proposal , 1998, LREC.

[12]  Nancy Ide,et al.  Encoding Linguistic Corpora , 1998, VLC@COLING/ACL.

[13]  Nancy Id,et al.  Encoding Linguistic Corpora , 1998, ACL 1998.

[14]  James Clark,et al.  XSL Transformations (XSLT) Version 1.0 , 1999 .

[15]  Nancy Ide,et al.  XCES: An XML-based Encoding Standard for Linguistic Corpora , 2000, LREC.

[16]  C. Michael Sperberg-McQueen,et al.  Extensible Markup Language (XML) Version 1.0 , 2000 .

[17]  Laurent Romary,et al.  CES/XML : An XML-based Standard for Linguistic Corpora , 2000 .