Entropy-Based Model for Estimating Veracity of Topics from Tweets

Micro-blogging sites like Twitter have gained tremendous growth and importance because these platforms allow users to share their experiences and opinions on various issues as they occur. Since tweets can cover a wide-range of domains many applications analyze them for knowledge extraction and prediction. As its popularity and size increase the veracity of the social media data itself becomes a concern. Applications processing social media data usually make the assumption that all information on social media are truthful and reliable. The integrity of data, data authenticity, trusted origin, trustworthiness are some of the aspects of trust-worthy data. This paper proposes an entropy-based model to estimate the veracity of topics in social media from truthful vantage point. Two existing big data veracity models namely, OTC model (Objectivity, Truthfulness, and Credibility) and DGS model (Diffusion, Geographic and Spam indices) are compared with the proposed model. The proposed model is a bag-of-words model based on keyword distribution, while OTC depends on word sentiment and DGS depends on tweet distribution and the content. For analysis, data from three domains (flu, food poisoning and politics) were used. Our experiments suggest that the approach followed for model definition impacts the resulting measures in ranking of topics, while all measures can place the topics in a veracity spectrum.

[1]  Cees T. A. M. de Laat,et al.  Big Security for Big Data: Addressing Security Challenges for the Big Data Infrastructure , 2013, Secure Data Management.

[2]  Andrea H. Tapia,et al.  Beyond the trustworthy tweet: A deeper understanding of microblogged data use by disaster response and humanitarian relief organizations , 2013, ISCRAM.

[3]  Colorado Reed Latent Dirichlet Allocation: Towards a Deeper Understanding , 2012 .

[4]  Günther Pernul,et al.  Trust and Big Data: A Roadmap for Research , 2014, 2014 25th International Workshop on Database and Expert Systems Applications.

[5]  Anabel Quan-Haase,et al.  Networks of digital humanities scholars: The informational and social uses and gratifications of Twitter , 2015, Big Data Soc..

[6]  Mark Davies The Corpus of Contemporary American English (COCA) , 2012 .

[7]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[8]  K. M. George,et al.  Veracity of information in twitter data: A case study , 2016, 2016 International Conference on Big Data and Smart Computing (BigComp).

[9]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[10]  J. Nunamaker,et al.  Automating Linguistics-Based Cues for Detecting Deception in Text-Based Asynchronous Computer-Mediated Communications , 2004 .

[11]  Lilly Suriani Affendey,et al.  A Systematic Review on the Profiling of Digital News Portal for Big Data Veracity , 2015 .

[12]  Okyay Kaynak,et al.  Big Data for Modern Industry: Challenges and Trends [Point of View] , 2015, Proc. IEEE.

[13]  Ross Harrison Philosophy after Objectivity: Making Sense in Perspective , 1996 .

[14]  Mylynn Felt,et al.  Social media and the social sciences: How researchers employ Big Data analytics , 2016, Big Data Soc..

[15]  Jianzhuang Liu,et al.  Probabilistic latent semantic analysis for sketch-based 3D model retrieval , 2014, 2014 4th IEEE International Conference on Information Science and Technology.

[16]  Victoria L. Rubin,et al.  Veracity Roadmap: Is Big Data Objective, Truthful and Credible? , 2014 .

[17]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[18]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[19]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[20]  J. Burgoon,et al.  Interpersonal Deception Theory , 2015 .