Quo Vadis: A Corpus of Entities and Relations

This chapter describes a collective work aimed to build a corpus including annotations of semantic relations on a text belonging to the belletristic genre. The paper presents conventions of annotations for four categories of semantic relations and the process of building the corpus as a collaborative work. Part of the annotation is done automatically, such as the token/part of speech/lemma layer, and is performed during a preprocessing phase. Then, an entity layer (where entities of type person are marked) and a relation layer (evidencing binary relations between entities) are added manually by a team of trained annotators, the result being a heavily annotated file. A number of methods to obtain accuracy are detailed. Finally, some statistics over the corpus are drawn. The language under investigation is Romanian, but the proposed annotation conventions and methodological hints are applicable to any language and text genre.

[1]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[2]  Piek Vossen,et al.  Using Semantic Relations to Solve Event Coreference in Text , 2012 .

[3]  M. Lynne Murphy,et al.  Semantic Relations and the Lexicon: Antonymy, Synonymy and other Paradigms , 2003 .

[4]  Yannick Versley,et al.  Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus , 2010, LREC.

[5]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[6]  Preslav Nakov,et al.  Semantic Relations Between Nominals , 2013, Semantic Relations Between Nominals.

[7]  Andrew McCallum,et al.  Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[8]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[9]  Eduard Hovy,et al.  Towards terascale knowledge acquisition , 2004, COLING 2004.

[10]  Mark Dredze,et al.  Entity Linking: Finding Extracted Entities in a Knowledge Base , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[11]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[12]  Dan I. Moldovan,et al.  Automatic Discovery of Part-Whole Relations , 2006, CL.

[13]  Yuji Matsumoto,et al.  Annotating a Japanese Text Corpus with Predicate-Argument and Coreference Relations , 2007, LAW@ACL.

[14]  L. Mazlack,et al.  Granular causality speculations , 2004, IEEE Annual Meeting of the Fuzzy Information, 2004. Processing NAFIPS '04..

[15]  Birger Hjørland,et al.  Semantics and knowledge organization , 2007, Annu. Rev. Inf. Sci. Technol..

[16]  Didier Schwab,et al.  L'index, une ressource vitale pour guider les auteurs à trouver le mot bloqué sur le bout de la langue. , 2013 .

[17]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[18]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[19]  Michael Zock,et al.  Deliberate word access: an intuition, a roadmap and some preliminary empirical results , 2010, Int. J. Speech Technol..

[20]  James Pustejovsky,et al.  Machine Learning of Temporal Relations , 2006, ACL.

[21]  Kirsten Malmkjaer,et al.  The Linguistics Encyclopedia , 2002 .

[22]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[23]  Sanda M. Harabagiu,et al.  Unsupervised Event Coreference Resolution with Rich Linguistic Features , 2010, ACL.

[24]  Daisuke Kawahara,et al.  Building a Diverse Document Leads Corpus Annotated with Semantic Relations , 2012, PACLIC.

[25]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[26]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[27]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[28]  Izumi Tanaka,et al.  The Value of an Annotated Corpus in the Investigation of Anaphoric Pronouns : With Particular Reference to Backwards Anaphora in English. , 2000 .

[29]  M. Ross Quillian,et al.  A revised design for an understanding machine , 1962, Mech. Transl. Comput. Linguistics.

[30]  Barbara Plank,et al.  Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) , 2010 .

[31]  Svetla Koeva,et al.  Summarizing Short Texts Through a Discourse-Centered Approach in a Multilingual Context , 2013 .

[32]  Diarmuid Ó Séaghdha,et al.  Semantic Classification with Distributional Kernels , 2008, COLING.

[33]  Judith N. Levi,et al.  The syntax and semantics of complex nominals , 1978 .

[34]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[35]  Kôiti Hasida,et al.  Construction of a Japanese Relevance-tagged Corpus , 2002, LREC.

[36]  Michael Zock,et al.  A Tool for Linking Stems and Conceptual Fragments to Enhance word Access , 2010, LREC.

[37]  Dan Cristea,et al.  An Integrating Framework for Anaphora Resolution , 2001 .

[38]  Rosa Del Gaudio,et al.  Automatic extraction of definitions , 2014 .

[39]  Jeffrey P. Bigham,et al.  Names and Similarities on the Web: Fact Extraction in the Fast Lane , 2006, ACL.

[40]  Jonathan Ginzburg,et al.  Proceedings of COLING 2004 , 2004 .

[41]  Horacio Saggion SHEF: Semantic Tagging and Summarization Techniques Applied to Cross-document Coreference , 2007, SemEval@ACL.

[42]  Jian Su,et al.  A Unified Event Coreference Resolution by Integrating Multiple Resolvers , 2011, IJCNLP.

[43]  Constantin Orasan,et al.  Transferring Coreference Chains through Word Alignment , 2006, LREC.

[44]  David Yarowsky,et al.  Induction of Fine-Grained Part-of-Speech Taggers via Classifier Combination and Crosslingual Projection , 2005, ParallelText@ACL.

[45]  Jerry R. Hobbs,et al.  Granularity in Natural Language Discourse , 2011, IWCS.