Building Linguistic Corpora from Wikipedia Articles and Discussions

Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.

[1]  Uli Kutter,et al.  Literatur. , 1941, Subjekt.

[2]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3]  Steven John Metsker The Design Patterns Java Workbook , 2002 .

[4]  Giuseppe Attardi,et al.  Semantically Annotated Snapshot of the English Wikipedia , 2008, LREC.

[5]  Nancy Ide,et al.  XCES: An XML-based Encoding Standard for Linguistic Corpora , 2000, LREC.

[6]  Nelleke Oostdijk,et al.  Variability in Dutch Tweets. An estimate of the proportion of deviant word tokens , 2014, J. Lang. Technol. Comput. Linguistics.

[7]  Ludovic Denoyer,et al.  The Wikipedia XML corpus , 2006, SIGF.

[8]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[9]  Thomas Bartz,et al.  Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge , 2013, J. Lang. Technol. Comput. Linguistics.

[10]  Noah Bubenhofer,et al.  A comparable Wikipedia corpus: from wiki syntax to POS tagged XML , 2011 .

[11]  Dirk Riehle,et al.  Design and implementation of the Sweble Wikitext parser: unlocking the structured data of Wikipedia , 2011, Int. Sym. Wikis.

[12]  Bryan Ford,et al.  Parsing expression grammars: a recognition-based syntactic foundation , 2004, POPL '04.

[13]  Angelika Storrer,et al.  A TEI Schema for the Representation of Computer-mediated Communication , 2012 .

[14]  Diana Inkpen,et al.  Segmentation Similarity and Agreement , 2012, NAACL.

[15]  Oliver Ferschke,et al.  Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages , 2012, EACL.

[16]  Marc Kupietz,et al.  Recent Developments in DeReKo , 2014, LREC.

[17]  Valentin Jijkoun,et al.  Overview of the WiQA Task at CLEF 2006 , 2006, CLEF.

[18]  Nancy Ide,et al.  Corpues enconding standard: SGML guidelines for encoding linguistic corpora , 1998, LREC.

[19]  Iryna Gurevych,et al.  A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles , 2012, COLING.

[20]  Harald Lüngen,et al.  A TEI P5 Document Grammar for the IDS Text Model , 2012 .

[21]  Gjergji Kasneci,et al.  YAWN: A Semantically Annotated Wikipedia XML Corpus , 2007, BTW.