Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection

This paper presents the compilation of the DSL corpus collection created for the DSL (Discriminating Similar Languages) shared task to be held at the VarDial workshop at COLING 2014. The DSL corpus collection were merged from three comparable corpora to provide a suitable dataset for automatic classification to discriminate similar languages and language varieties. Along with the description of the DSL corpus collection we also present results of baseline discrimination experiments reporting performance of up to 87.4% accuracy.

[1]  Chengqing Zong,et al.  Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora , 2008, COLING.

[2]  Eiichiro Sumita,et al.  Building a Bilingual Dictionary from a Japanese-Chinese Patent Corpus , 2013, CICLing.

[3]  Atsushi Fujita,et al.  FUN-NRC: Paraphrase-augmented Phrase-based SMT Systems for NTCIR-10 PatentMT , 2013, NTCIR.

[4]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[5]  Pierre Zweigenbaum,et al.  8. Contextual acquisition of information categories: What has been done and what can be done automatically? , 2002 .

[6]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[7]  Preslav Nakov,et al.  Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[8]  Iryna Gurevych,et al.  Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoperability , 2012, EACL.

[9]  James Pustejovsky,et al.  Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference , 2005, FCA@ACL.

[10]  Marianna Apidianaki,et al.  Vector Disambiguation for Translation Extraction from Comparable Corpora , 2013, Informatica.

[11]  Bing Liang,et al.  Semi-Automatic Identification of Bilingual Synonymous Technical Terms from Phrase Tables and Parallel Patent Sentences , 2011, PACLIC.

[12]  Mikio Yamamoto,et al.  Integrating a Phrase-based SMT Model and a Bilingual Lexicon for Semi-Automatic Acquisition of Technical Term Translation Lexicons , 2008, AMTA.

[13]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[14]  Evon M. O. Abu-Taieh,et al.  Comparative Study , 2020, Definitions.

[15]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[16]  Erich Steiner Translated Texts: Properties, Variants, Evaluations , 2004 .

[17]  Pierre Zweigenbaum,et al.  Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge , 2013, EMNLP.

[18]  Benjamin Ka-Yin T'sou,et al.  Towards Bilingual Term Extraction in Comparable Patents , 2009, PACLIC.

[19]  Hal Daumé,et al.  Extracting Multilingual Topics from Unaligned Comparable Corpora , 2010, ECIR.

[20]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[21]  Sivaji Bandyopadhyay,et al.  MWE Alignment in Phrase Based Statistical Machine Translation , 2013, MTSUMMIT.

[22]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[23]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[24]  Christopher D. Manning,et al.  Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines , 2008 .

[25]  Jian-Yun Nie,et al.  Parallel Web text mining for cross-language IR , 2000, RIAO.

[26]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[27]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[28]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.

[29]  Stefanie Anstein,et al.  Computational approaches to the comparison of regional variety corpora : prototyping a semi-automatic system for German , 2013 .

[30]  Timothy Baldwin,et al.  Multilingual Language Identification: ALTW 2010 Shared Task Data , 2010, ALTA.

[31]  N. Mikelic,et al.  Language Indentification: How to Distinguish Similar Languages? , 2007, 2007 29th International Conference on Information Technology Interfaces.

[32]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34]  Takashi Tsunakawa,et al.  Bilingual Synonym Identification with Spelling Variations , 2008, IJCNLP.

[35]  Erich Steiner,et al.  Cross-Linguistic Corpora for the Study of Translations: Insights from the Language Pair English-German , 2012 .

[36]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[37]  Alexander Mehler,et al.  Riding the Rough Waves of Genre on the Web , 2011, Genres on the Web.

[38]  Georges Linarès,et al.  Post-édition statistique pour l’adaptation aux domaines de spécialité en traduction automatique (Statistical Post-Editing of Machine Translation for Domain Adaptation) [in French] , 2012, JEP/TALN/RECITAL.

[39]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[40]  Jing Sun,et al.  Can Word Segmentation be Considered Harmful for Statistical Machine Translation Tasks between Japanese and Chinese? , 2012, PACLIC.

[41]  Lucia Specia,et al.  Multilingual WSD-like Constraints for Paraphrase Extraction , 2013, CoNLL.

[42]  Yves Peirsman,et al.  Cross-lingual Induction of Selectional Preferences with Bilingual Vector Spaces , 2010, NAACL.

[43]  Douglas Biber,et al.  Dimensions of Register Variation , 1995 .

[44]  Chu-Ren Huang,et al.  Contrastive Approach towards Text Source Classification based on Top-Bag-of-Word Similarity , 2008, PACLIC.

[45]  Susan T. Dumais,et al.  Automatic cross-linguistic information retrieval using latent semantic indexing , 2007 .

[46]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[47]  ˇ IvanaLu Efficient Discrimination Between Closely Related Languages , 2012 .

[48]  Andreas Eisele,et al.  Improving Machine Translation Performance Using Comparable Corpora , 2010 .

[49]  Serge Sharoff,et al.  Document dissimilarity within and across languages: A benchmarking study , 2014, Lit. Linguistic Comput..

[50]  Ming Zhou,et al.  Identifying Synonyms among Distributionally Similar Words , 2003, IJCAI.

[51]  Bogdan Babych,et al.  Development and Application of a Cross-language Document Comparability Metric , 2012, LREC.

[52]  Kenneth Ward Church,et al.  Work on Statistical Methods for Word Sense Disambiguation , 1992 .

[53]  Holger Schwenk,et al.  Exploiting Comparable Corpora with TER and TERp , 2009, BUCC@ACL/IJCNLP.

[54]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[55]  Emmanuel Morin,et al.  Adaptive Dictionary for Bilingual Lexicon Extraction from Comparable Corpora , 2012, LREC.

[56]  Stefan Th. Gries,et al.  What is Corpus Linguistics? , 2009, Lang. Linguistics Compass.

[57]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[58]  Pablo Gamallo,et al.  Is singular value decomposition useful for word similarity extraction? , 2011, Lang. Resour. Evaluation.

[59]  Jörg Tiedemann,et al.  Efficient Discrimination Between Closely Related Languages , 2012, COLING.

[60]  Montserrat Marimon,et al.  Towards the automatic merging of language resources , 2011 .

[61]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[62]  Sivaji Bandyopadhyay,et al.  Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora , 2013, BUCC@ACL.

[63]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[64]  Emmanuel Morin,et al.  Bilingual Lexicon Extraction from Comparable Corpora Enhanced with Parallel Corpora , 2011, BUCC@ACL.

[65]  M. Utiyama,et al.  A Japanese-English patent parallel corpus , 2007, MTSUMMIT.

[66]  A. Kilgarriff Comparing Corpora , 2001 .

[67]  Takako Aikawa,et al.  Automatic validation of terminology translation consistenscy with statistical method , 2007, MTSUMMIT.

[68]  Yves Peirsman,et al.  The automatic identification of lexical variation between language varieties , 2010, Natural Language Engineering.

[69]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[70]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[71]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[72]  Marcos Zampieri,et al.  N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French] , 2013, JEP/TALN/RECITAL.

[73]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[74]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[75]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[76]  Pablo Gamallo Otero Learning bilingual lexicons from comparable English and Spanish corpora , 2007, MTSUMMIT.

[77]  Pierre Zweigenbaum,et al.  Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora , 2013, ACL.

[78]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[79]  Bing Liang,et al.  Identifying Bilingual Synonymous Technical Terms from Phrase Tables and Parallel Patent Sentences , 2011 .

[80]  Christian Boitet,et al.  Online production of HQ parallel corpora and permanent task-based evaluation of multiple MT systems: both can be obtained through iMAGs with no added cost , 2013, MTSUMMIT.

[81]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[82]  P. Nather N-Gram based Text Categorization , 2005 .

[83]  Marco Lui,et al.  Classifying English Documents by National Dialect , 2013, ALTA.

[84]  Z. Harris A Theory of Language and Information: A Mathematical Approach , 1991 .

[85]  Oi Yee Kwong,et al.  The Construction of a Chinese-English Patent Parallel Corpus , 2009, MTSUMMIT.

[86]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[87]  Marianna Apidianaki Translation-oriented Word Sense Induction Based on Parallel Corpora , 2008, LREC.

[88]  Pierre Zweigenbaum,et al.  The Effect of a General Lexicon in Corpus-Based Identification of French-English Medical Word Translations , 2003, MIE.

[89]  Yun-Chuang Chiao,et al.  A Novel Approach to Improve Word Translations Extraction from Non-Parallel , Comparable Corpora , 2004 .

[90]  Svenja Kranich,et al.  Changing conventions in English-German translations of popular scientific texts , 2012 .

[91]  Eric Gaussier,et al.  Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables , 2007 .

[92]  Chris Callison-Burch,et al.  Expectations of Word Sense in Parallel Corpora , 2012, NAACL.

[93]  Pierre Zweigenbaum,et al.  Automatic Information Extraction in the Medical Domain by Cross-Lingual Projection , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[94]  Stefan Riezler,et al.  Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus , 2012, IRFC.

[95]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[96]  Bali Ranaivo-Malancon,et al.  Automatic Identification of Close Languages - Case study: Malay and Indonesian , 1970 .

[97]  Jakob Uszkoreit,et al.  Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure , 2012, NAACL.

[98]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[99]  Yuji Matsumoto,et al.  Lexical Knowledge Acquisition , 2005 .

[100]  M. Halliday,et al.  Language, Context, and Text: Aspects of Language in a Social-Semiotic Perspective , 1989 .

[101]  Darja Fiser,et al.  Bilingual lexicon extraction from comparable corpora for closely related languages , 2011, RANLP.

[102]  Wei Xu,et al.  Gathering and Generating Paraphrases from Twitter with Application to Normalization , 2013, BUCC@ACL.

[103]  Zellig S. Harris,et al.  Language and information , 1988 .

[104]  Chris Callison-Burch,et al.  Paraphrase Fragment Extraction from Monolingual Comparable Corpora , 2011, BUCC@ACL.

[105]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[106]  Pierre Zweigenbaum,et al.  Translating medical terminologies through word alignment in parallel text corpora , 2009, J. Biomed. Informatics.

[107]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[108]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[109]  Stella Neumann,et al.  Contrastive Register Variation: A Quantitative Approach to the Comparison of English and German , 2013, Modern Language Review.

[110]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[111]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[112]  Nitin Madnani,et al.  Using Paraphrases for Parameter Tuning in Statistical Machine Translation , 2007, WMT@ACL.

[113]  Elke Teich,et al.  Cross-linguistic variation in system and text , 2003 .

[114]  Iñaki San Vicente,et al.  Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain , 2008 .