The Manually Annotated Sub-Corpus: A Community Resource for and by the People

The Manually Annotated Sub-Corpus (MASC) project provides data and annotations to serve as the base for a communitywide annotation effort of a subset of the American National Corpus. The MASC infrastructure enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or ported to any of a variety of other formats. MASC includes data from a much wider variety of genres than existing multiply-annotated corpora of English, and the project is committed to a fully open model of distribution, without restriction, for all data and annotations produced or contributed. As such, MASC is the first large-scale, open, community-based effort to create much needed language resources for NLP. This paper describes the MASC project, its corpus and annotations, and serves as a call for contributions of data and annotations from the language processing community.

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Nancy Ide,et al.  Making Sense of Word Sense Variation , 2009, SEW@NAACL-HLT.

[3]  Christiane Fellbaum,et al.  MASC: the Manually Annotated Sub-Corpus of American English , 2008, LREC.

[4]  Christiane Fellbaum,et al.  WordNet and FrameNet as Complementary Resources for Annotation , 2009, Linguistic Annotation Workshop.

[5]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.

[6]  Laurent Viennot,et al.  Partition Refinement Techniques: An Interesting Algorithmic Tool Kit , 1999, Int. J. Found. Comput. Sci..

[7]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[8]  Nancy Ide,et al.  Word Sense Annotation of Polysemous Words by Multiple Annotators , 2010, LREC.

[9]  WagnerWiebke Steven Bird, Ewan Klein and Edward Loper , 2010, LREC 2010.

[10]  Mitchell P. Marcus,et al.  OntoNotes: A Unified Relational Semantic Representation , 2007, International Conference on Semantic Computing (ICSC 2007).

[11]  Eduard Hovy,et al.  OntoNotes: A Unified Relational Semantic Representation , 2007 .

[12]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[13]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[14]  Nancy Ide,et al.  ANC2Go: A Web Application for Customized Corpus Creation , 2010, LREC.