docrep: A lightweight and efficient document representation framework

Modelling linguistic phenomena requires highly structured and complex data representations. Document representation frameworks (DRFs) provide an interface to store and retrieve multiple annotation layers over a document. Researchers face a difficult choice: using a heavy-weight DRF or implement a custom DRF. The cost is substantial, either learning a new complex system, or continually adding features to a home-grown system that risks overrunning its original scope. We introduce DOCREP, a lightweight and efficient DRF, and compare it against existing DRFs. We discuss our design goals and implementations in C++, Python, and Java. We transform the OntoNotes 5 corpus using DOCREP and UIMA, providing a quantitative comparison, as well as discussing modelling trade-offs. We conclude with qualitative feedback from researchers who have used DOCREP for their own projects. Ultimately, we hope DOCREP is useful for the busy researcher who wants the benefits of a DRF, but has better things to do than to write one.

[1]  Graham Wilcock,et al.  Unstructured Information Management Architecture (UIMA) , 2009 .

[2]  Joel Nothman,et al.  Event Linking: Grounding Event Reference in a News Archive , 2012, ACL.

[3]  Nancy Ide,et al.  Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF , 2013, LAW@ACL.

[4]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[5]  Christiane Fellbaum,et al.  The Manually Annotated Sub-Corpus: A Community Resource for and by the People , 2010, ACL.

[6]  James R. Curran,et al.  An annotated corpus of quoted opinions in news articles , 2013, ACL.

[7]  Joel Nothman,et al.  (Almost) Total Recall - SYDNEY CMCRC at TAC 2012 , 2012, TAC.

[8]  Thilo Götz,et al.  Design and implementation of the UIMA Common Analysis System , 2004, IBM Syst. J..

[9]  Chris Brew,et al.  Data-Intensive Linguistics , 2002 .

[10]  Nancy Ide,et al.  Bridging the Gaps: Interoperability for GrAF, GATE, and UIMA , 2009, Linguistic Annotation Workshop.

[11]  Dan Roth,et al.  An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) , 2012, LREC.

[12]  Nancy Ide,et al.  International Standard for a Linguistic Annotation Framework , 2003, Natural Language Engineering.

[13]  Nancy Ide,et al.  Representing Linguistic Corpora and Their Annotations , 2006, LREC.

[14]  Nancy Ide,et al.  GrAF: A Graph-based Format for Linguistic Annotations , 2007, LAW@ACL.

[15]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[16]  Hwee Tou Ng,et al.  Towards Robust Linguistic Analysis using OntoNotes , 2013, CoNLL.

[17]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[18]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[19]  Joel Nothman,et al.  SYDNEY CMCRC at TAC 2013 , 2013, TAC.