论文信息 - Proceedings of the 13th Conference on Natural Language Processing, KONVENS 2016, Bochum, Germany, September 19-21, 2016

Proceedings of the 13th Conference on Natural Language Processing, KONVENS 2016, Bochum, Germany, September 19-21, 2016

The most important reasons for examining “non-standard data” with CL methods are the facts that this data represents a great deal of language behavior and that it serves as an object of scientific study in linguistics as a whole. This is true of the syntax of non-native second-language learners, the accents of non-native speakers, and the vocabularies of different dialect speakers. Computational linguists have a good deal to offer to the various subfields of linguistics studying non-standard data. By automating steps in analysis we make the analyses replicable and also modifiable, we improve opportunities for comparing similar analyses, and perhaps most importantly, we enable the analyses of large amounts of data, providing more comprehensive views. The data itself can be tricky to work with, however, as scientists in other fields are often specialized in a single language or language pair, which means that their data will not be varied enough to support all the research questions one would like to ask, e.g., the question of the generality of the techniques for a particular purpose. In other cases, the data simply won’t have been collected with an eye to answering some interesting questions, which may mean that important parameters haven’t been recorded. Finally, we note that non-automated analyses do not impose expectations that data be commensurate to the same strict degree (as automated ones), meaning that surprises can be in store even in well-studied data sets. This paper provides some concrete examples and discussion of these potential pitfalls. One can protect oneself from some of these risks by seeking collaboration with domain experts, which is to be recommended in any case, as a way of making the work richer and better informed. Further, it makes sense to approach novels sorts of data — and even novel sources of data of a sort one suspects is familiar — with a broad range of potential research questions. There is an awful lot of interesting work still to be done!

John Nerbonne | J. Nerbonne

[1] Timothy Baldwin,et al. langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[2] Swapna Somasundaran,et al. Recognizing Stances in Online Debates , 2009, ACL.

[3] Martin Volk,et al. Building a Parallel Corpus on the World's Oldest Banking Magazine , 2016, KONVENS.

[4] Hermann Ney,et al. Can We Translate Letters? , 2007, WMT@ACL.

[5] Simon J. Greenhill,et al. The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[6] Stefanie Dipper,et al. Annotating Discourse Anaphora , 2009, Linguistic Annotation Workshop.

[7] Simon J. Greenhill,et al. Mapping the Origins and Expansion of the Indo-European Language Family , 2012, Science.

[8] Malvina Nissim,et al. Sentiment analysis on Italian tweets , 2013, WASSA@NAACL-HLT.

[9] P. Anand,et al. Verb Classes as Evaluativity Functor Classes , 2010 .

[10] Bryan Jurish. Finding canonical forms for historical German text , 2008, KONVENS.

[11] Silvia Bernardini,et al. Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[12] Vito Pirrelli,et al. The PAISÀ Corpus of Italian Web Texts , 2014, WaC@EACL.

[13] Austin F. Frank,et al. Analyzing linguistic data: a practical introduction to statistics using R , 2010 .

[14] Janyce Wiebe,et al. Recognizing Arguing Subjectivity and Argument Tags , 2012, ExProM@ACL.

[15] Martin Volk,et al. Challenges in Building a Multilingual Alpine Heritage Corpus , 2010, LREC.

[16] Graeme Hirst,et al. Resolving "This-issue" Anaphora , 2012, EMNLP-CoNLL.

[17] Qing Zeng-Treitler,et al. A semantic and syntactic text simplification tool for health content. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[18] Stefanie Dipper,et al. Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation , 2011, LTC.

[19] Gökhan Tür,et al. Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[20] M. Coulthard. Author Identification, Idiolect, and Linguistic Uniqueness. , 2004 .

[21] Ines Rehbein,et al. Annotating Discourse Relations in Spoken Language: A Comparison of the PDTB and CCR Frameworks , 2016, LREC.

[22] Eiríkur Rögnvaldsson,et al. Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic , 2007 .

[23] Isabelle Tellier,et al. POS-tagging for Oral Texts with CRF and Category Decomposition , 2010, CICLing 2010.

[24] Geoffrey Leech,et al. The tagged LOB Corpus : user's manual , 1986 .

[25] Dawn Archer,et al. Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora , 2007 .

[26] Arkaitz Zubiaga,et al. Introducción a la Tarea Compartida Tweet-Norm 2013: Normalización Léxica de Tuits en Español , 2013, Tweet-Norm@SEPLN.

[27] Michael Piotrowski,et al. Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[28] Oliver Ferschke,et al. DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data , 2014, ACL.

[29] I. Zusammenfassung Vorblatt,et al. Bundesministerium der Justiz und für Verbraucherschutz , 2015 .

[30] Lamia Hadrich Belguith,et al. Clause-based Discourse Segmentation of Arabic Texts , 2012, LREC.

[31] Andrei Popescu-Belis,et al. What are discourse markers ? , 2003 .

[32] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[33] Jörg Tiedemann,et al. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[34] Nora Hollenstein,et al. Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging , 2014, VarDial@COLING.

[35] Graeme Hirst,et al. Resolving Shell Nouns , 2014, EMNLP.

[36] Günter Neumann,et al. Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[37] R. Flesch. A new readability yardstick. , 1948, The Journal of applied psychology.

[38] Stefanie Dipper,et al. Abstract Anaphors in German and English , 2011, DAARC.

[39] Robert B. Dewell. The Semantics of German Verb Prefixes , 2015 .

[40] Daniel Marcu,et al. Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[41] Marilyn A. Walker,et al. Collective Stance Classification of Posts in Online Debate Forums , 2014 .

[42] Ulrich Reffle. Efficiently generating correction suggestions for garbled tokens of historical language , 2011, Nat. Lang. Eng..

[43] Els Lefever,et al. TExSIS: Bilingual terminology extraction from parallel corpora using chunk-based alignment. , 2013 .

[44] Satoshi Sekine,et al. Named entities : recognition, classification and use , 2009 .

[45] Iskandar Keskes,et al. Segmentation de textes arabes en unités discursives minimales , 2013 .

[46] Abdessatar Mahfoudhi,et al. A Minimalist Account of Word Order and Agreement Variation in Arabic , 2002 .

[47] R Core Team,et al. R: A language and environment for statistical computing. , 2014 .

[48] Yi Yang,et al. Part-of-Speech Tagging for Historical English , 2016, NAACL.

[49] Joakim Nivre,et al. Issues in Translating Verb-Particle Constructions from German to English , 2014, MWE@EACL.

[50] Yves Scherrer,et al. Normalising orthographic and dialectal variants for the automatic processing of Swiss German , 2015 .

[51] Vincent Ng,et al. Stance Classification of Ideological Debates: Data, Models, Features, and Constraints , 2013, IJCNLP.

[52] Matthew Shardlow,et al. A Survey of Automated Text Simplification , 2014 .

[53] Eric Brill,et al. An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[54] M. Levandowsky,et al. Distance between Sets , 1971, Nature.

[55] Martin Volk,et al. Innovations in Parallel Corpus Search Tools , 2014, LREC.

[56] Yang Liu,et al. Normalization of Text Messages Using Character- and Phone-based Machine Translation Approaches , 2012, INTERSPEECH.

[57] Thierry Poibeau,et al. Proper Name Extraction from Non-Journalistic Texts , 2000, CLIN.

[58] Klaus U. Schulz,et al. Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary? , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[59] E Tulving,et al. Priming and human memory systems. , 1990, Science.

[60] Pierre Nugues,et al. A High-Performance Syntactic and Semantic Dependency Parser , 2010, COLING.

[61] Dawn Archer,et al. The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..

[62] Fabienne Fritzinger. Using parallel text for the extraction of German multiword expressions , 2010 .

[63] Hans-Jörg Schmid,et al. English abstract nouns as conceptual shells : from corpus to cognition , 2000 .

[64] Els Lefever,et al. Parallel corpora make sense: Bypassing the knowledge acquisition bottleneck for Word Sense Disambiguation , 2014 .

[65] Alexander Mehler,et al. A Comparison of Four Character-Level String-to-String Translation Models for (OCR) Spelling Error Correction , 2016, Prague Bull. Math. Linguistics.

[66] Nina Wacholder,et al. Analyzing Argumentative Discourse Units in Online Interactions , 2014, ArgMining@ACL.

[67] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .

[68] John D. Lafferty,et al. Cyberpunc: a lightweight punctuation annotation system for speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[69] Cecil H. Brown,et al. Sound Correspondences in the World's Languages , 2013 .

[70] Adrien Barbaresi. Efficient construction of metadata-enhanced web corpora , 2016, WAC@ACL.

[71] Eric Laporte,et al. An Electronic Dictionary of French Multiword Adverbs , 2008, LREC 2008.

[72] Klaus U. Schulz,et al. Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations , 2007 .

[73] Rico Sennrich,et al. Strategies for Reducing and Correcting OCR Errors , 2011, Language Technology for Cultural Heritage.

[74] Gabriella Vigliocco,et al. Integrating experiential and distributional data to learn semantic representations. , 2009, Psychological review.

[75] Rico Sennrich,et al. Iterative, MT-based Sentence Alignment of Parallel Texts , 2011, NODALIDA.

[76] Joakim Nivre,et al. Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting , 2013, NODALIDA.

[77] Pavel Vondricka. Aligning parallel texts with InterText , 2014, LREC.

[78] Iryna Gurevych,et al. Argumentation Mining on the Web from Information Seeking Perspective , 2014, ArgNLP.

[79] P. Bennett,et al. Annotating a historical corpus of German : A case study , 2010 .

[80] David Kauchak,et al. Learning to Simplify Sentences Using Wikipedia , 2011, Monolingual@ACL.