A Comparable Corpus Based on Aligned Multilingual Ontologies

In this paper we present a methodology for building comparable corpus, using multilingual ontologies of a scpecific domain. This resource can be exploited to foster research on multilingual corpus-based ontology learning, population and matching. The building resource process is exemplified by the construction of annotated comparable corpora in English, Portuguese, and French. The corpora, from the conference organization domain, are built using the multilingual ontology concept labels as seeds for crawling relevant documents from the web through a search engine. Using ontologies allows a better coverage of the domain. The main goal of this paper is to describe the design methodology followed by the creation of the corpora. We present a preliminary evaluation and discuss their characteristics and potential applications.

[1]  Emmanuel Morin,et al.  Bilingual Lexicon Extraction from Comparable Corpora Enhanced with Parallel Corpora , 2011, BUCC@ACL.

[2]  Marco Baroni,et al.  Building general- and special-purpose corpora by Web crawling , 2006 .

[3]  Frank Keller,et al.  Using the Web to Overcome Data Sparseness , 2002, EMNLP.

[4]  Preslav Nakov,et al.  Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus , 2011, EMNLP.

[5]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[6]  Adam Kilgarriff Googleology is Bad Science , 2007, Computational Linguistics.

[7]  Gregory Grefenstette,et al.  The World Wide Web as a Resource for Example-Based Machine Translation Tasks , 1999, TC.

[8]  Pierre Zweigenbaum,et al.  Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web , 2011 .

[9]  Inguna Skadina,et al.  A Collection of Comparable Corpora for Under-resourced Languages , 2010, Baltic HLT.

[10]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[11]  Serge Sharo Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[12]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[13]  V. Svátek,et al.  OntoFarm : Towards an Experimental Collection of Parallel Ontologies , 2005 .

[14]  Radu Ion,et al.  An Expectation Maximization Algorithm for Textual Unit Alignment , 2011, BUCC@ACL.