论文信息 - Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods

Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods

Machine learning methods offer a great potential to automatically investigate large amounts of data in the humanities. Our contribution to the workshop reports about ongoing work in the BMBF project KobRA (http://www.kobra.tu-dortmund.de) where we apply machine learning methods to the analysis of big corpora in language-focused research of computer-mediated communication (CMC). At the workshop, we will discuss first results from training a Support Vector Machine (SVM) for the classification of selected linguistic features in talk pages of the German Wikipedia corpus in DeReKo provided by the IDS Mannheim. We will investigate different representations of the data to integrate complex syntactic and semantic information for the SVM. The results shall foster both corpus-based research of CMC and the annotation of linguistic features in CMC corpora.

Eliza Margaretha | Harald Lüngen | Christian Pölitz | Michael Beißwenger

[1] Angelika Storrer,et al. A TEI Schema for the Representation of Computer-mediated Communication , 2012 .

[2] Michael Beißwenger,et al. Sprachhandlungskoordination in der Chat-Kommunikation , 2007 .

[3] Christopher D. Manning,et al. Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines , 2008 .

[4] Burr Settles,et al. Active Learning Literature Survey , 2009 .

[5] Susanne Günthner,et al. Grammatikalisierung von weil als Diskursmarker in der gesprochenen Sprache , 1999 .

[6] Ursula Faber,et al. Sequence Organization In Interaction A Primer In Conversation Analysis , 2016 .

[7] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2004 .

[8] Nello Cristianini,et al. Classification using String Kernels , 2000 .

[9] Peter Auer,et al. Die Entstehung von Diskursmarkern im Deutschen - ein Fall von Grammatikalisierung? , 2003 .

[10] Harald Lüngen,et al. A TEI P5 Document Grammar for the IDS Text Model , 2012 .

[11] Angelika Storrer,et al. Sprachstil und Sprachvariation in sozialen Netzwerken , 2013, Die Dynamik sozialer und sprachlicher Netzwerke.

[12] Hinrich Schütze,et al. FLORS: Fast and Simple Domain Adaptation for Part-of-Speech Tagging , 2014, TACL.

[13] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[14] Alessandro Moschitti,et al. Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.