Characterizing discussions in the Spanish Wikipedia

Wikipedia, as the largest online encyclopedia, is edited collaboratively by hundreds of users. The content in some articles can have dispute, giving rise to discussions which are registered in the related talk pages. In this paper, we propose an annotation schema for Spanish Wikipedia talk pages in order to determine the type of opinions expressed in them. We apply the annotation schema to a corpus that includes a collection of discussions about 148 topics drawn from 25 Spanish Wikipedia talk pages. We make the resulting dataset publicly available for download on github1. Furthermore, we train and evaluate supervised machine learning models to automatically identify the annotation labels. Linear Support Vector classifier (LinearSVC) performs better compared to other baseline models, and achieves an accuracy F1 = 0.71 in our experiments.

[1]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[2]  Claudia Leacock,et al.  Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications , 2010 .

[3]  Maribel Acosta,et al.  WikiWho: precise and efficient attribution of authorship of revisioned content , 2014, WWW.

[4]  Yana Volkovich,et al.  When the Wikipedians Talk: Network and Tree Structure of Wikipedia Discussion Pages , 2011, ICWSM.

[5]  Adam Kilgarriff,et al.  of the European Chapter of the Association for Computational Linguistics , 2006 .

[6]  Aaron Halfaker,et al.  Edit Categories and Editor Role Identification in Wikipedia , 2016, LREC.

[7]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[8]  Y. Liu Authority, presumption, and invention , 1997 .

[9]  Li Wang,et al.  Tagging and Linking Web Forum Posts , 2010, CoNLL.

[10]  Veronika Laippala,et al.  French Wikipedia Talk Pages: Profiling and Conflict Detection , 2016 .

[11]  Les Gasser,et al.  Information quality work organization in wikipedia , 2008, J. Assoc. Inf. Sci. Technol..

[12]  Taemin Kim Park,et al.  The visibility of Wikipedia in scholarly publications , 2011, First Monday.

[13]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[14]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[15]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[18]  Claire Cardie,et al.  Improving Agreement and Disagreement Identification in Online Discussions with A Socially-Tuned Sentiment Lexicon , 2014, WASSA@ACL.

[19]  J. Jensen Public Spheres on the Internet: Anarchic or Government‐Sponsored – A Comparison , 2003 .

[20]  Emi Fujioka,et al.  The Role and Identification of Dialog Acts in Online Chat , 2011, Analyzing Microtext.

[21]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[22]  Pádraig Cunningham,et al.  The influence of network structures of Wikipedia discussion pages on the efficiency of WikiProjects , 2015, Soc. Networks.

[23]  Drucilla Cornell,et al.  Force of Law: The "Mystical Foundation of Authority" , 2016 .

[24]  Tom M. Mitchell,et al.  Learning to Classify Email into “Speech Acts” , 2004, EMNLP.

[25]  Loren G. Terveen,et al.  Does “authority” mean quality? predicting expert quality ratings of Web documents , 2000, SIGIR '00.

[26]  Jihie Kim,et al.  Towards identifying unresolved discussions in student online forums , 2010, Applied Intelligence.

[27]  Jonathan T. Morgan,et al.  Annotating Social Acts: Authority Claims and Alignment Moves in Wikipedia Talk Pages , 2011 .

[28]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .