Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

We present the results of the 2nd edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the LT4VarDial’2015 workshop and focused on the identification of very similar languages and language varieties. Unlike in the 2014 edition, in 2015 we had an Others category with languages that were not seen on training. Moreover, we had two test datasets: one using the original texts (test set A), and one with named entities replaced by placeholders (test set B). Ten teams participated in the task, and the best-performing system achieved 95.54% average accuracy on test set A, and 94.01% on test set B.

[1]  Paolo Rosso,et al.  Language Variety Identification Using Distributed Representations of Words and Documents , 2015, CLEF.

[2]  Paolo Rosso,et al.  On the Multilingual and Genre Robustness of EmoGraphs for Author Profiling in Social Media , 2015, CLEF.

[3]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[4]  Joel R. Tetreault,et al.  Oracle and Human Baselines for Native Language Identification , 2015, BEA@NAACL-HLT.

[5]  Shervin Malmasi,et al.  Measuring Feature Diversity in Native Language Identification , 2015, BEA@NAACL-HLT.

[6]  Shervin Malmasi,et al.  Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[7]  Krister Lindén,et al.  Language Set Identification in Noisy Synthetic Multilingual Documents , 2015, CICLing.

[8]  Carlos Gómez-Rodríguez,et al.  Language variety identification in Spanish tweets , 2014, EMNLP 2014.

[9]  Theerawat Songyot,et al.  Improving Word Alignment using Word Similarity , 2014, EMNLP.

[10]  José-Luis Sancho-Gómez,et al.  Using Maximum Entropy Models to Discriminate between Similar Languages and Varieties , 2014, VarDial@COLING.

[11]  Fatiha Sadat,et al.  Automatic Identification of Arabic Language Varieties and Dialects in Social Media , 2014, SocialNLP@COLING.

[12]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[13]  Timothy Baldwin,et al.  Exploring Methods and Resources for Discriminating Similar Languages , 2014, VarDial@COLING.

[14]  Marine Carpuat,et al.  The NRC System for Discriminating Similar Languages , 2014, VarDial@COLING.

[15]  Matthew Purver,et al.  A Simple Baseline for Discriminating Similar Languages , 2014, VarDial@COLING.

[16]  Dragomir R. Radev,et al.  Experiments in Sentence Language Identification with Groups of Similar Languages , 2014, VarDial@COLING.

[17]  Yaser Al-Onaizan,et al.  Improved Sentence-Level Arabic Dialect Classification , 2014, VarDial@COLING.

[18]  Taro Watanabe,et al.  Recurrent Neural Networks for Word Alignment Model , 2014, ACL.

[19]  John DeNero,et al.  A Constrained Viterbi Relaxation for Bidirectional Word Alignment , 2014, ACL.

[20]  Marcos Zampieri,et al.  VarClass: An Open-source Language Identification Tool for Language Varieties , 2014, LREC.

[21]  Liviu P. Dinu,et al.  Temporal Text Ranking and Automatic Dating of Texts , 2014, EACL.

[22]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[23]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[24]  Marco Lui,et al.  Classifying English Documents by National Dialect , 2013, ALTA.

[25]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[26]  Nenghai Yu,et al.  Word Alignment Modeling with Context Dependent Deep Neural Network , 2013, ACL.

[27]  Yuji Matsumoto,et al.  Hidden Markov Tree Model for Word Alignment , 2013, WMT@ACL.

[28]  Mona T. Diab,et al.  Code Switch Point Detection in Arabic , 2013, NLDB.

[29]  Peter Wittenburg,et al.  Improving Native Language Identification with TF-IDF Weighting , 2013, BEA@NAACL-HLT.

[30]  Marine Carpuat,et al.  Feature Space Selection and Combination for Native Language Identification , 2013, BEA@NAACL-HLT.

[31]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[32]  Marcos Zampieri,et al.  N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French] , 2013, JEP/TALN/RECITAL.

[33]  Shervin Malmasi,et al.  NLI Shared Task 2013: MQ Submission , 2013, BEA@NAACL-HLT.

[34]  Wouter Weerkamp,et al.  Microblog language identification: overcoming the limitations of short, unedited and idiomatic text , 2012, Language Resources and Evaluation.

[35]  Jörg Tiedemann,et al.  Efficient Discrimination Between Closely Related Languages , 2012, COLING.

[36]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[37]  Timothy Baldwin,et al. An Off-the-shelf Language Identification Tool , 2012, ACL.

[38]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[39]  Timothy Baldwin,et al.  Cross-domain Feature Selection for Language Identification , 2011, IJCNLP.

[40]  Taro Watanabe,et al.  An Unsupervised Model for Joint Phrase Alignment and Extraction , 2011, ACL.

[41]  John DeNero,et al.  Model-Based Aligner Combination Using Dual Decomposition , 2011, ACL.

[42]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[43]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[44]  Fei Xia,et al.  The Problems of Language Identification within Hugely Multilingual Data Sets , 2010, LREC.

[45]  Chu-Ren Huang,et al.  Contrastive Approach towards Text Source Classification based on Top-Bag-of-Word Similarity , 2008, PACLIC.

[46]  Christopher D. Manning,et al.  Introduction to information retrieval , 2008 .

[47]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[48]  N. Mikelic,et al.  Language Indentification: How to Distinguish Similar Languages? , 2007, 2007 29th International Conference on Information Technology Interfaces.

[49]  John DeNero,et al.  Tailoring Word Alignments to Syntactic Machine Translation , 2007, ACL.

[50]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[51]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[52]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[53]  Necip Fazil Ayan,et al.  Going Beyond AER: An Extensive Analysis of Word Alignments and Their Impact on MT , 2006, ACL.

[54]  Phil Blunsom,et al.  Discriminative Word Alignment with Conditional Random Fields , 2006, ACL.

[55]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[56]  Kavi Narayana Murthy,et al.  Language identification from small text samples* , 2006, J. Quant. Linguistics.

[57]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[58]  Robert C. Moore A Discriminative Framework for Bilingual Word Alignment , 2005, HLT.

[59]  Ben Taskar,et al.  A Discriminative Matching Approach to Word Alignment , 2005, HLT.

[60]  Paul N. Bennett Using asymmetric distributions to improve text classifier probability estimates , 2003, SIGIR.

[61]  Ted Pedersen,et al.  An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[62]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[63]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[64]  Kris Popat,et al.  A Hierarchical Model for Clustering and Categorising Documents , 2002, ECIR.

[65]  Ludmila I. Kuncheva,et al.  A Theoretical Study on Six Classifier Fusion Strategies , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[66]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[67]  Louisa Lam,et al.  Classifier Combinations: Implementations and Theoretical Issues , 2000, Multiple Classifier Systems.

[68]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[69]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[70]  Josef Kittler,et al.  Combining classifiers , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[71]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[72]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[73]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[74]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[75]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[76]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[77]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[78]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[79]  V. Chvátal,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[80]  G. Yule On the Methods of Measuring Association between Two Attributes , 1912 .

[81]  Paolo Rosso,et al.  On the impact of emotions on author profiling , 2016, Inf. Process. Manag..

[82]  Shervin Malmasi,et al.  Large-Scale Native Language Identification with Cross-Corpus Evaluation , 2015, NAACL.

[83]  Shervin Malmasi,et al.  Automatic Language Identification for Persian and Dari texts , 2015 .

[84]  Shervin Malmasi,et al.  Discriminating Similar Languages: Persian and Dari , 2015, Tiny Trans. Comput. Sci..

[85]  Nikola Ljubesic,et al.  Discriminating Between Closely Related Languages on Twitter , 2015, Informatica.

[86]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[87]  Jörg Tiedemann,et al.  Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[88]  Nizar Habash,et al.  Sentence Level Dialect Identification for Machine Translation System Selection , 2014, ACL.

[89]  Arkaitz Zubiaga,et al.  Overview of TweetLID: Tweet Language Identification at SEPLN 2014 , 2014, TweetLID@SEPLN.

[90]  Nikola Ljubešić,et al.  Discriminating between VERY similar languages among Twitter users , 2014 .

[91]  อนิรุธ สืบสิงห์ Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[92]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[93]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[94]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[95]  Tommi Jauhiainen,et al.  Tekstin kielen automaattinen tunnistaminen , 2010 .

[96]  Kagan Tumer,et al.  Classifier ensembles: Select real-world applications , 2008, Inf. Fusion.

[97]  Victoria Bobicev,et al.  Comparison of Word-based and Letter-based Text Classification , 2007 .

[98]  Steven Bird NLTK: The Natural Language Toolkit , 2006, ACL.

[99]  Victoria Bobicev,et al.  Text Classification Using Word-Based PPM Models , 2006, Comput. Sci. J. Moldova.

[100]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[101]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[102]  Eiichiro Sumita,et al.  Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World , 2002, LREC.

[103]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[104]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[105]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[106]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[107]  Bali Ranaivo-Malancon,et al.  Automatic Identification of Close Languages - Case study: Malay and Indonesian , 1970 .