The Per-Fide Corpus : A new Resource for Corpus-Based Terminology, Contrastive Linguistics and Translation Studies

The Per-Fide project is a joint collaboration between researchers at the Department of Informatics and the Institute of Arts and Humanities at the University of Minho, Portugal. The acronym Per-Fide stands for Portuguese (P) in parallel with 6 languages: English (E), Russian (R), French (F), Italian (I), German/Deutsch (D) and Spanish/ Espanol (E). First, we expound on the role of the Per-Fide project within the context of existing corpora that include the Portuguese language in its different variants – namely, European Portuguese, Brazilian Portuguese and Portuguese spoken in African countries (Angola, Mozambique, Guinea-Bissau, Cape Verde, Sao Tome and Principe). The idea of creating a multilingual parallel corpus project in which Portuguese assumes a pivotal role arose primarily due to the fact that the majority of online corpora that include Portuguese are either monolingual or bilingual. Furthermore, these corpora focus mainly on one specific text type. Consequently, the few multilingual parallel corpora that include Portuguese consist of a relatively small Portuguese subcorpus1 and provide limited search facilities mainly due to the fact that the Portuguese texts have not been morphologically tagged and/or syntactically annotated. Our second goal in this chapter is to provide an overview of the design criteria for the development of tools and resources in the various stages of the Per-Fide corpora construction process, focusing particularly on automation, validation, generalization and resource sharing. Here, a brief description of the workflow components involved in the preand post-alignment phases will be included. Finally, we draw attention to several practical applications of the current features of the Per-Fide corpus in translation practice and contrastive linguistic studies, focusing on the use and potential of probabilistic translation dictionaries and the role that parallel corpora can play in translating idioms.

[1]  Elke St. John A Case for Using a Parallel Corpus and Concordancer for Beginners of a Foreign Language , 2001 .

[2]  Lluís Padró,et al.  Analizadores Multilingües en FreeLing , 2011, Linguamática.

[3]  José João Almeida,et al.  Grabbing parallel corpora from Web , 2002, Proces. del Leng. Natural.

[4]  Olivier Kraif Qu'attendre de l'alignement de corpus multilingües? , 2006 .

[5]  Jacques Lerot,et al.  Corpus-based Approaches to Contrastive Linguistics and Translation Studies , 2003 .

[6]  Elke St A CASE FOR USING A PARALLEL CORPUS AND CONCORDANCER FOR BEGINNERS OF A FOREIGN LANGUAGE , 2001 .

[7]  Roy T. Fielding,et al.  Principled design of the modern Web architecture , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[8]  G. Aston Learning with corpora , 2001 .

[9]  Jörg Tiedemann,et al.  Combining Clues for Word Alignment , 2003, EACL.

[10]  Stefan Evert,et al.  Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium , 2011 .

[11]  Atila Ertas,et al.  Design Analysis and Applications , 2011 .

[12]  F. Grossmann,et al.  Les collocations: analyse et traitement. , 2003 .

[13]  Jörg Tiedemann,et al.  The OPUS corpus : parallel and free , 2004 .

[14]  Sylviane Granger,et al.  Comparable and translation corpora in cross-linguistic research. Design, analysis and applications , 2010 .

[15]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[16]  André Fernandes Santos ConContributions for building a Corpora-Flow system , 2011 .

[17]  S. Granger The International Corpus of Learner English: A New Resource for Foreign Language Learning and Teaching and Second Language Acquisition Research , 2003 .

[18]  José João Almeida,et al.  NATools - A statistical Word Aligner Workbench , 2003, Proces. del Leng. Natural.

[19]  Maurice Gross,et al.  La fonction sémantique des verbes supports , 1998 .

[20]  Ana Frankenberg-Garcia Pedagogical uses of monolingual and parallel concordances , 2005 .

[21]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[22]  Diana Santos DISPARA, a System for Distributing Parallel Corpora on the Web , 2002, PorTAL.

[23]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[24]  A. Buxton,et al.  Guide to the Use of UDC: An Introductory Guide to the Use and Application of the Universal Decimal Classification , 1993 .

[25]  John Sinclair,et al.  How to use corpora in language teaching , 2004 .

[26]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[27]  Ia McIlwaine The Universal Decimal Classification: A guide to its use , 2000 .

[28]  Diana Santos,et al.  COMPARA, um corpus paralelo de português e inglês na Web. , 2002 .

[29]  Salah Mejri,et al.  Le figement lexical : descriptions linguistiques et structuration sémantique , 1998 .

[30]  Salah Mejri,et al.  Le figement lexical , 2003 .

[31]  José João Almeida,et al.  Structural alignment of plain text books , 2012, LREC.