Proceedings of the 13th Conference on Natural Language Processing, KONVENS 2016, Bochum, Germany, September 19-21, 2016

The most important reasons for examining “non-standard data” with CL methods are the facts that this data represents a great deal of language behavior and that it serves as an object of scientific study in linguistics as a whole. This is true of the syntax of non-native second-language learners, the accents of non-native speakers, and the vocabularies of different dialect speakers. Computational linguists have a good deal to offer to the various subfields of linguistics studying non-standard data. By automating steps in analysis we make the analyses replicable and also modifiable, we improve opportunities for comparing similar analyses, and perhaps most importantly, we enable the analyses of large amounts of data, providing more comprehensive views. The data itself can be tricky to work with, however, as scientists in other fields are often specialized in a single language or language pair, which means that their data will not be varied enough to support all the research questions one would like to ask, e.g., the question of the generality of the techniques for a particular purpose. In other cases, the data simply won’t have been collected with an eye to answering some interesting questions, which may mean that important parameters haven’t been recorded. Finally, we note that non-automated analyses do not impose expectations that data be commensurate to the same strict degree (as automated ones), meaning that surprises can be in store even in well-studied data sets. This paper provides some concrete examples and discussion of these potential pitfalls. One can protect oneself from some of these risks by seeking collaboration with domain experts, which is to be recommended in any case, as a way of making the work richer and better informed. Further, it makes sense to approach novels sorts of data — and even novel sources of data of a sort one suspects is familiar — with a broad range of potential research questions. There is an awful lot of interesting work still to be done!

[1]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[2]  Swapna Somasundaran,et al.  Recognizing Stances in Online Debates , 2009, ACL.

[3]  Martin Volk,et al.  Building a Parallel Corpus on the World's Oldest Banking Magazine , 2016, KONVENS.

[4]  Hermann Ney,et al.  Can We Translate Letters? , 2007, WMT@ACL.

[5]  Simon J. Greenhill,et al.  The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[6]  Stefanie Dipper,et al.  Annotating Discourse Anaphora , 2009, Linguistic Annotation Workshop.

[7]  Simon J. Greenhill,et al.  Mapping the Origins and Expansion of the Indo-European Language Family , 2012, Science.

[8]  Malvina Nissim,et al.  Sentiment analysis on Italian tweets , 2013, WASSA@NAACL-HLT.

[9]  P. Anand,et al.  Verb Classes as Evaluativity Functor Classes , 2010 .

[10]  Bryan Jurish Finding canonical forms for historical German text , 2008, KONVENS.

[11]  Silvia Bernardini,et al.  Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[12]  Vito Pirrelli,et al.  The PAISÀ Corpus of Italian Web Texts , 2014, WaC@EACL.

[13]  Austin F. Frank,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2010 .

[14]  Janyce Wiebe,et al.  Recognizing Arguing Subjectivity and Argument Tags , 2012, ExProM@ACL.

[15]  Martin Volk,et al.  Challenges in Building a Multilingual Alpine Heritage Corpus , 2010, LREC.

[16]  Graeme Hirst,et al.  Resolving "This-issue" Anaphora , 2012, EMNLP-CoNLL.

[17]  Qing Zeng-Treitler,et al.  A semantic and syntactic text simplification tool for health content. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[18]  Stefanie Dipper,et al.  Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation , 2011, LTC.

[19]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[20]  M. Coulthard Author Identification, Idiolect, and Linguistic Uniqueness. , 2004 .

[21]  Ines Rehbein,et al.  Annotating Discourse Relations in Spoken Language: A Comparison of the PDTB and CCR Frameworks , 2016, LREC.

[22]  Eiríkur Rögnvaldsson,et al.  Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic , 2007 .

[23]  Isabelle Tellier,et al.  POS-tagging for Oral Texts with CRF and Category Decomposition , 2010, CICLing 2010.

[24]  Geoffrey Leech,et al.  The tagged LOB Corpus : user's manual , 1986 .

[25]  Dawn Archer,et al.  Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora , 2007 .

[26]  Arkaitz Zubiaga,et al.  Introducción a la Tarea Compartida Tweet-Norm 2013: Normalización Léxica de Tuits en Español , 2013, Tweet-Norm@SEPLN.

[27]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[28]  Oliver Ferschke,et al.  DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data , 2014, ACL.

[29]  I. Zusammenfassung Vorblatt,et al.  Bundesministerium der Justiz und für Verbraucherschutz , 2015 .

[30]  Lamia Hadrich Belguith,et al.  Clause-based Discourse Segmentation of Arabic Texts , 2012, LREC.

[31]  Andrei Popescu-Belis,et al.  What are discourse markers ? , 2003 .

[32]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[33]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[34]  Nora Hollenstein,et al.  Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging , 2014, VarDial@COLING.

[35]  Graeme Hirst,et al.  Resolving Shell Nouns , 2014, EMNLP.

[36]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[37]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[38]  Stefanie Dipper,et al.  Abstract Anaphors in German and English , 2011, DAARC.

[39]  Robert B. Dewell The Semantics of German Verb Prefixes , 2015 .

[40]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[41]  Marilyn A. Walker,et al.  Collective Stance Classification of Posts in Online Debate Forums , 2014 .

[42]  Ulrich Reffle Efficiently generating correction suggestions for garbled tokens of historical language , 2011, Nat. Lang. Eng..

[43]  Els Lefever,et al.  TExSIS: Bilingual terminology extraction from parallel corpora using chunk-based alignment. , 2013 .

[44]  Satoshi Sekine,et al.  Named entities : recognition, classification and use , 2009 .

[45]  Iskandar Keskes,et al.  Segmentation de textes arabes en unités discursives minimales , 2013 .

[46]  Abdessatar Mahfoudhi,et al.  A Minimalist Account of Word Order and Agreement Variation in Arabic , 2002 .

[47]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[48]  Yi Yang,et al.  Part-of-Speech Tagging for Historical English , 2016, NAACL.

[49]  Joakim Nivre,et al.  Issues in Translating Verb-Particle Constructions from German to English , 2014, MWE@EACL.

[50]  Yves Scherrer,et al.  Normalising orthographic and dialectal variants for the automatic processing of Swiss German , 2015 .

[51]  Vincent Ng,et al.  Stance Classification of Ideological Debates: Data, Models, Features, and Constraints , 2013, IJCNLP.

[52]  Matthew Shardlow,et al.  A Survey of Automated Text Simplification , 2014 .

[53]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[54]  M. Levandowsky,et al.  Distance between Sets , 1971, Nature.

[55]  Martin Volk,et al.  Innovations in Parallel Corpus Search Tools , 2014, LREC.

[56]  Yang Liu,et al.  Normalization of Text Messages Using Character- and Phone-based Machine Translation Approaches , 2012, INTERSPEECH.

[57]  Thierry Poibeau,et al.  Proper Name Extraction from Non-Journalistic Texts , 2000, CLIN.

[58]  Klaus U. Schulz,et al.  Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary? , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[59]  E Tulving,et al.  Priming and human memory systems. , 1990, Science.

[60]  Pierre Nugues,et al.  A High-Performance Syntactic and Semantic Dependency Parser , 2010, COLING.

[61]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..

[62]  Fabienne Fritzinger Using parallel text for the extraction of German multiword expressions , 2010 .

[63]  Hans-Jörg Schmid,et al.  English abstract nouns as conceptual shells : from corpus to cognition , 2000 .

[64]  Els Lefever,et al.  Parallel corpora make sense: Bypassing the knowledge acquisition bottleneck for Word Sense Disambiguation , 2014 .

[65]  Alexander Mehler,et al.  A Comparison of Four Character-Level String-to-String Translation Models for (OCR) Spelling Error Correction , 2016, Prague Bull. Math. Linguistics.

[66]  Nina Wacholder,et al.  Analyzing Argumentative Discourse Units in Online Interactions , 2014, ArgMining@ACL.

[67]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[68]  John D. Lafferty,et al.  Cyberpunc: a lightweight punctuation annotation system for speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[69]  Cecil H. Brown,et al.  Sound Correspondences in the World's Languages , 2013 .

[70]  Adrien Barbaresi Efficient construction of metadata-enhanced web corpora , 2016, WAC@ACL.

[71]  Eric Laporte,et al.  An Electronic Dictionary of French Multiword Adverbs , 2008, LREC 2008.

[72]  Klaus U. Schulz,et al.  Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations , 2007 .

[73]  Rico Sennrich,et al.  Strategies for Reducing and Correcting OCR Errors , 2011, Language Technology for Cultural Heritage.

[74]  Gabriella Vigliocco,et al.  Integrating experiential and distributional data to learn semantic representations. , 2009, Psychological review.

[75]  Rico Sennrich,et al.  Iterative, MT-based Sentence Alignment of Parallel Texts , 2011, NODALIDA.

[76]  Joakim Nivre,et al.  Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting , 2013, NODALIDA.

[77]  Pavel Vondricka Aligning parallel texts with InterText , 2014, LREC.

[78]  Iryna Gurevych,et al.  Argumentation Mining on the Web from Information Seeking Perspective , 2014, ArgNLP.

[79]  P. Bennett,et al.  Annotating a historical corpus of German : A case study , 2010 .

[80]  David Kauchak,et al.  Learning to Simplify Sentences Using Wikipedia , 2011, Monolingual@ACL.

[81]  Yves Scherrer,et al.  ArchiMob - A Corpus of Spoken Swiss German , 2016, LREC.

[82]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[83]  T. Mark Ellison,et al.  Bayesian Identification of Cognates and Correspondences , 2007, SIGMORPHON.

[84]  Christian Biemann,et al.  Text: now in 2D! A framework for lexical expansion with contextual similarity , 2013, J. Lang. Model..

[85]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[86]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[87]  Brigitte Bigi,et al.  A Multilingual Text Normalization Approach , 2011, LTC.

[88]  Jeanne Sternlicht Chall,et al.  Readability: An Appraisal of Research and Application , 2012 .

[89]  Lucia Specia,et al.  LEXenstein: A Framework for Lexical Simplification , 2015, ACL.

[90]  Jörg Tiedemann,et al.  Character-Based PSMT for Closely Related Languages , 2009, EAMT.

[91]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[92]  Zachary Estes,et al.  Lexical priming: Associative, semantic, and thematic influences on word recognition , 2012 .

[93]  Dan Klein,et al.  Automated reconstruction of ancient languages using probabilistic models of sound change , 2013, Proceedings of the National Academy of Sciences.

[94]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[95]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[96]  Shaoqun Wu,et al.  Supporting Collocation Learning , 2010 .

[97]  Raman Chandrasekar,et al.  Motivations and Methods for Text Simplification , 1996, COLING.

[98]  Timothy Baldwin,et al.  Lexical normalization for social media text , 2013, TIST.

[99]  Amita Misra,et al.  Using Summarization to Discover Argument Facets in Online Idealogical Dialog , 2017, NAACL.

[100]  Paul Rayson,et al.  VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .

[101]  Leo G. M. Noordman,et al.  Toward a taxonomy of coherence relations , 1992 .

[102]  Véronique Hoste,et al.  Normalization of Dutch User-Generated Content , 2013, RANLP.

[103]  Joakim Nivre,et al.  A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text , 2014, LaTeCH@EACL.

[104]  Saif Mohammad,et al.  SemEval-2016 Task 6: Detecting Stance in Tweets , 2016, *SEMEVAL.

[105]  Anna Nedoluzhko,et al.  Across Languages and Genres: Creating a Universal Annotation Scheme for Textual Relations , 2015, LAW@NAACL-HLT.

[106]  Tomaz Erjavec,et al.  Standardizing Tweets with Character-Level Machine Translation , 2014, CICLing.

[107]  Cynthia L. Allen,et al.  Case-Marking and Reanalysis: Grammatical Relations from Old to Early Modern English , 1995 .

[108]  Tomaž Erjavec,et al.  Normalising Slovene data: historical texts vs. user-generated content , 2016, KONVENS.

[109]  Siobhan Chapman Logic and Conversation , 2005 .

[110]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[111]  Arne Jönsson,et al.  A Tool for Automatic Simplification of Swedish Texts , 2015, NODALIDA.

[112]  Igor Boguslavsky,et al.  Development of a Dependency Treebank for Russian and its Possible Applications in NLP , 2002, LREC.

[113]  Janyce Wiebe,et al.  An Investigation for Implicatures in Chinese : Implicatures in Chinese and in English are similar ! , 2014, WASSA@ACL.

[114]  Martin Forst,et al.  PARTICLE VERBS IN COMPUTATIONAL LFGS: ISSUES FROM ENGLISH, GERMAN, AND HUNGARIAN , 2010 .

[115]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[116]  Vincent Ng,et al.  Why are You Taking this Stance? Identifying and Classifying Reasons in Ideological Debates , 2014, EMNLP.

[117]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[118]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[119]  Timothy Baldwin,et al.  Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition , 2015, NUT@IJCNLP.

[120]  Elia Bruni,et al.  Distributional semantics from text and images , 2011, GEMS.

[121]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[122]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[123]  Marco Baroni,et al.  Morph-it! A free corpus-based morphological resource for the Italian language , 2005 .

[124]  Simon Clematide,et al.  Bi-particle Adverbs, PoS-Tagging and the Recognition of German Separable Prefix Verbs , 2016, KONVENS.

[125]  Dimitar Kazakov,et al.  Using Parallel Corpora for Word Sense Disambiguation , 2013, RANLP.

[126]  Mary Hare,et al.  Activating event knowledge , 2009, Cognition.

[127]  Hermann Ney,et al.  Improving SMT quality with morpho-syntactic analysis , 2000, COLING.

[128]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[129]  Caroline Gasperin,et al.  Fostering Digital Inclusion and Accessibility: The PorSimples project for Simplification of Portuguese Texts , 2010, NAACL.

[130]  R. Gordon Shallow techniques for argument mining , 2017 .

[131]  Andrew Meade,et al.  Detecting Regular Sound Changes in Linguistics as Events of Concerted Evolution , 2015, Current Biology.

[132]  Maite Taboada,et al.  Annotation upon Annotation: Adding Signalling Information to a Corpus of Discourse Relations , 2013, Dialogue Discourse.

[133]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[134]  Alessandro Lenci,et al.  Distributional Memory: A General Framework for Corpus-Based Semantics , 2010, CL.

[135]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[136]  Iryna Gurevych,et al.  Exploiting Debate Portals for Semi-Supervised Argumentation Mining in User-Generated Web Discourse , 2015, EMNLP.

[137]  Yannick Versley,et al.  A Syntax-first Approach to High-quality Morphological Analysis and Lemma Disambiguation for the TüBa-D/Z Treebank , 2010 .

[138]  Malvina Nissim,et al.  Overview of the Evalita 2014 SENTIment POLarity Classification Task , 2014 .

[139]  Manfred Klenner,et al.  Sentiframes: A Resource for Verb-centered German Sentiment Inference , 2016, LREC.

[140]  Marilyn A. Walker,et al.  Cats Rule and Dogs Drool!: Classifying Stance in Online Debate , 2011, WASSA@ACL.

[141]  Manfred Stede,et al.  Potsdam Commentary Corpus 2.0: Annotation for Discourse Research , 2014, LREC.

[142]  Anke Lüdeling,et al.  On Particle Verbs and Similar Constructions in German , 2001 .

[143]  Brian Ecker,et al.  Argument Mining: Extracting Arguments from Online Dialogue , 2015, SIGDIAL Conference.

[144]  Veronika Vincze,et al.  VPCTagger: Detecting Verb-Particle Constructions With Syntax-Based Methods , 2014, MWE@EACL.

[145]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[146]  Hrafn Loftsson,et al.  Tagging Icelandic text: A linguistic rule-based approach , 2008, Nordic Journal of Linguistics.

[147]  Manfred Stede,et al.  Ranking the annotators: An agreement study on argumentation structure , 2013, LAW@ACL.

[148]  Paul Bennett,et al.  Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text , 2011, LaTeCH@ACL.

[149]  Ines Gloeckner,et al.  Relevance Communication And Cognition , 2016 .

[150]  Brian D. Davison,et al.  Normalizing Microtext , 2011, Analyzing Microtext.

[151]  Ted Briscoe,et al.  Capturing Anomalies in the Choice of Content Words in Compositional Distributional Semantic Space , 2013, RANLP.

[152]  Anja Habacha Chaïbi,et al.  Topic Segmentation for Textual Document Written in Arabic Language , 2014, KES.

[153]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[154]  Advaith Siddharthan,et al.  Syntactic Simplification and Text Cohesion , 2006 .

[155]  Philippe Blache,et al.  Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation , 2015, Res. Comput. Sci..

[156]  Maria Khokhlova Extracting collocations in Russian: Statistics vs. Dictionary , 2008 .

[157]  Nicholas Asher,et al.  Reference to abstract objects in discourse , 1993, Studies in linguistics and philosophy.

[158]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[159]  Christian Chiarcos,et al.  A New Hybrid Dependency Parser for German , 2009 .

[160]  Federico Zanettin,et al.  Translation-Driven Corpora: Corpus Resources for Descriptive and Applied Translation Studies , 2014 .

[161]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[162]  Graeme Hirst,et al.  Annotating Anaphoric Shell Nouns with their Antecedents , 2013, LAW@ACL.

[163]  Xabier Arregi,et al.  XUXEN: A Spelling Checker/Corrector for Basque Based on Two-Level Morphology , 1992, ANLP.

[164]  Iryna Gurevych,et al.  A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[165]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[166]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[167]  Noraini Seman,et al.  Sentence boundary detection without speech recognition: A case of an under-resourced language , 2015 .

[168]  Erhard W. Hinrichs,et al.  Modeling Prefix and Particle Verbs in GermaNet , 2014, GWC.

[169]  Janyce Wiebe,et al.  +/-EffectWordNet: Sense-level Lexicon Acquisition for Opinion Inference , 2014, EMNLP.

[170]  Husni Al-Muhtaseb,et al.  AUTOMATIC SEGMENTATION OF ARABIC SPEECH , 2007 .

[171]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[172]  Sabine Schulte im Walde,et al.  Exploiting Fine-grained Syntactic Transfer Features to Predict the Compositionality of German Particle Verbs , 2015, IWCS.

[173]  Hervé Déjean,et al.  Extracting structured data from unstructured document with incomplete resources , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[174]  José-Luis Sancho-Gómez,et al.  Word Normalization in Twitter Using Finite-state Transducers , 2013, Tweet-Norm@SEPLN.

[175]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[176]  Janyce Wiebe,et al.  Benefactive/Malefactive Event and Writer Attitude Annotation , 2013, ACL.

[177]  Rico Sennrich,et al.  Zmorge: A German Morphological Lexicon Extracted from Wiktionary , 2014, LREC.

[178]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[179]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[180]  Michael Flor,et al.  Four types of context for automatic spelling correction , 2012, TAL.

[181]  Adrian Bingham,et al.  ‘The Digitization of Newspaper Archives: Opportunities and Challenges for Historians’ , 2010 .

[182]  Marcel Bollmann,et al.  (Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool , 2012 .

[183]  Marie-Francine Moens,et al.  Argumentation mining: the detection, classification and structure of arguments in text , 2009, ICAIL.

[184]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[185]  Johannes Dellert,et al.  Using computational criteria to extract large Swadesh lists for lexicostatistics , 2016 .

[186]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[187]  Tatsuya Kawahara,et al.  Sentence boundary detection of spontaneous Japanese using statistical language model and support vector machines , 2006, INTERSPEECH.

[188]  Stan Matwin,et al.  From Argumentation Mining to Stance Classification , 2015, ArgMining@HLT-NAACL.

[189]  Ning Jin NCSU-SAS-Ning: Candidate Generation and Feature Engineering for Supervised Lexical Normalization , 2015, NUT@IJCNLP.

[190]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[191]  Elena M. Zamora,et al.  The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..

[192]  A. Bruns,et al.  Twitter and Society , 2013 .

[193]  Martin Volk,et al.  Building a German/Simple German Parallel Corpus for Automatic Text Simplification , 2013, PITR@ACL.

[194]  Jan Snajder,et al.  Back up your Stance: Recognizing Arguments in Online Discussions , 2014, ArgMining@ACL.