Controversy detection in Wikipedia using semantic dissimilarity

Abstract The advent of search engines and wikis has made access to information easy and almost free. Wikipedia is the efficacious outcome of an enormous collaboration, and its peer review-like methods of creation, maintenance, and evolution of contents, ensure high quality and reliability. However, the “anyone-can-edit” policy of Wikipedia has created many problems such as trolling, vandalism, controversies, and doubts about the content and reliability of the information provided due to non-expert involvement. People have tried to identify and rank controversies in Wikipedia articles through various techniques that use quantitative data, ignoring the semantic significance of conflicts among authors. In this paper, we have addressed the problem of identifying controversy using natural language processing techniques for the first time. The proposed method spots the impact on existing meanings of the text due to new editing processes along with their relationship to the topic of the article. The experimental results for precision (0.901), recall (0.901), accuracy (0.908), and F-measure (0.901) demonstrate the effectiveness of the proposed method. The technique is deemed useful for automatic identification of conflicts newly introduced into existing article contents, and could prove helpful in making decisions for inclusion or exclusion of controversies under the same topic.

[1]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[2]  D. Turnbull,et al.  THERMAL EVIDENCE OF A GLASS TRANSITION IN GOLD-SILICON-GERMANIUM ALLOY. , 1967 .

[3]  Ke-Jia Chen,et al.  Ranking Wikipedia article's data quality by learning dimension distributions , 2014, Int. J. Inf. Qual..

[4]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[5]  Chris Brew,et al.  Is getting the right answer just about choosing the right words? The role of syntactically-informed features in short answer scoring , 2014, ArXiv.

[6]  Jimmy J. Lin,et al.  Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks , 2015, EMNLP.

[7]  Andrea Tagarelli,et al.  Exploring dictionary-based semantic relatedness in labeled tree data , 2013, Inf. Sci..

[8]  Boris A. Galitsky,et al.  Parse Thicket Representation for Multi-sentence Search , 2013, ICCS.

[9]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[10]  Aniket Kittur,et al.  He says, she says: conflict and coordination in Wikipedia , 2007, CHI.

[11]  Benno Stein,et al.  Towards automatic quality assurance in Wikipedia , 2011, WWW.

[12]  Andrew McCallum,et al.  Word Representations via Gaussian Embedding , 2014, ICLR.

[13]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[14]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[15]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[16]  Mark Graham,et al.  The most controversial topics in Wikipedia: A multilingual and geographical analysis , 2013, ArXiv.

[17]  Steven Bethard,et al.  DLS@CU: Sentence Similarity from Word Alignment and Semantic Vector Composition , 2015, *SEMEVAL.

[18]  Divesh Srivastava,et al.  Fine-grained controversy detection in Wikipedia , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[19]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[20]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[21]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[22]  Scott Sanner,et al.  Context-Aware Detection of Sneaky Vandalism on Wikipedia Across Multiple Languages , 2015, PAKDD.

[23]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[24]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.

[25]  Luca de Alfaro,et al.  A content-driven reputation system for the wikipedia , 2007, WWW '07.

[26]  Aaron Halfaker,et al.  Don't bite the newbies: how reverts affect the quantity and quality of Wikipedia work , 2011, Int. Sym. Wikis.

[27]  Wenyin Liu,et al.  A short text modeling method combining semantic and statistical information , 2010, Inf. Sci..

[28]  András Kornai,et al.  Edit Wars in Wikipedia , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[29]  Andrew McCallum,et al.  Learning to Predict the Quality of Contributions to Wikipedia , 2008 .

[30]  Trevor I. Dix,et al.  A Bit-String Longest-Common-Subsequence Algorithm , 1986, Inf. Process. Lett..

[31]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[32]  Cristina V. Lopes,et al.  Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso , 2011, Int. Sym. Wikis.

[33]  Ulrik Brandes,et al.  Visual Analysis of Controversy in User-generated Encyclopedias , 2007 .

[34]  Martin Wattenberg,et al.  Proceedings of the 40th Hawaii International Conference on System Sciences- 2007 Talk Before You Type: Coordination in Wikipedia , 2022 .

[35]  Ulrik Brandes,et al.  Network analysis of collaboration structure in Wikipedia , 2009, WWW '09.

[36]  Adam Wierzbicki,et al.  Predicting Controversy of Wikipedia Articles Using the Article Feedback Tool , 2014, SocialCom '14.

[37]  M. Dolores del Castillo,et al.  SyMSS: A syntax-based measure for short-text semantic similarity , 2011, Data Knowl. Eng..

[38]  András Kornai,et al.  Dynamics of Conflicts in Wikipedia , 2012, PloS one.

[39]  Ee-Peng Lim,et al.  On ranking controversies in wikipedia: models and evaluation , 2008, WSDM '08.

[40]  Daniel Jurafsky,et al.  Do Multi-Sense Embeddings Improve Natural Language Understanding? , 2015, EMNLP.

[41]  Denilson Barbosa,et al.  Identifying Controversial Wikipedia Articles Using Editor Collaboration Networks , 2015, ACM Trans. Intell. Syst. Technol..

[42]  Denilson Barbosa,et al.  Identifying controversial articles in Wikipedia: a comparative study , 2012, WikiSym '12.

[43]  Adam Wierzbicki,et al.  Verifying social network models of Wikipedia knowledge community , 2016, Inf. Sci..

[44]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.