Veracity Roadmap: Is Big Data Objective, Truthful and Credible?

This paper argues that big data can possess different characteristics, which affect its quality. Depending on its origin, data processing technologies, and methodologies used for data collection and scientific discoveries, big data can have biases, ambiguities, and inaccuracies which need to be identified and accounted for to reduce inference errors and improve the accuracy of generated insights. Big data veracity is now being recognized as a necessary property for its utilization, complementing the three previously established quality dimensions (volume, variety, and velocity), But there has been little discussion of the concept of veracity thus far. This paper provides a roadmap for theoretical and empirical definitions of veracity along with its practical implications. We explore veracity across three main dimensions: 1) objectivity/subjectivity, 2) truthfulness/deception, 3) credibility/implausibility – and propose to operationalize each of these dimensions with either existing computational tools or potential ones, relevant particularly to textual data analytics. We combine the measures of veracity dimensions into one composite index – the big data veracity index. This newly developed veracity index provides a useful way of assessing systematic variations in big data quality across datasets with textual information. The paper contributes to the big data research by categorizing the range of existing tools to measure the suggested dimensions, and to Library and Information Science (LIS) by proposing to account for heterogeneity of diverse big data, and to identify information quality dimensions important for each big data type.

[1]  Soo Young Rieh Credibility and Cognitive Authority of Information , 2010 .

[2]  C W Elliott,et al.  Are you sure? , 1970, The Journal of the American Osteopathic Association.

[3]  Werner J. Severin,et al.  Communication Theories: Origins, Methods and Uses in the Mass Media , 1991 .

[4]  Besiki Stvilia,et al.  Issues of cross-contextual information quality evaluation—The case of Arabic, English, and Korean Wikipedias , 2009 .

[5]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[6]  James Pustejovsky,et al.  Are You Sure That This Happened? Assessing the Factuality Degree of Events in Text , 2012, CL.

[7]  M. de Rijke,et al.  Credibility Improves Topical Blog Post Retrieval , 2008, ACL.

[8]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[9]  B. Dervin AN OVERVIEW OF SENSE-MAKING RESEARCH: CONCEPTS, METHODS AND RESULTS TO DATE , 1983 .

[10]  Eileen Fitzpatrick,et al.  Verification and Implementation of Language-Based Deception Indicators in Civil and Criminal Narratives , 2008, COLING.

[11]  Michael Sanford Nilan,et al.  Toward a reconceptualization of information seeking research: focus on the exchange of meaning , 1999, Inf. Process. Manag..

[12]  Joseph Moses Juran Juran on Quality by Design: The New Steps for Planning Quality into Goods and Services , 1992 .

[13]  J. Burgoon,et al.  Interpersonal Deception Theory , 1996 .

[14]  Marcia J. Bates After the Dot-Bomb: Getting Web Information Retrieval Right This Time , 2002, First Monday.

[15]  Kirsten A. Johnson,et al.  Enhancing Perceived Credibility of Citizen Journalism Web Sites , 2009 .

[16]  Janyce Wiebe,et al.  A Corpus Study of Evaluative and Speculative Language , 2001, SIGDIAL Workshop.

[17]  Matthias Jarke,et al.  Dwq : Esprit Long Term Research Project, No 22469 Data Warehouse Quality: a Review of the Dwq Project , 2022 .

[18]  Besiki Stvilia,et al.  A model for ontology quality evaluation , 2007, First Monday.

[19]  M. de Rijke,et al.  Credibility-inspired ranking for blog post retrieval , 2012, Information Retrieval.

[20]  Victoria L. Rubin,et al.  Discerning truth from deception: Human judgments and automation efforts , 2012, First Monday.

[21]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[22]  A. Vrij,et al.  Cues to Deception and Ability to Detect Lies as a Function of Police Interview Styles , 2007, Law and human behavior.

[23]  Birger Hjørland,et al.  Information: Objective or subjective/situational? , 2007, J. Assoc. Inf. Sci. Technol..

[24]  David Crystal What Is Linguistics , 1968 .

[25]  Judee K. Burgoon,et al.  Interpersonal deception: III. Effects of deceit on perceived communication and nonverbal behavior dynamics , 1994 .

[26]  Ephraim R. McLean,et al.  Information Systems Success: The Quest for the Dependent Variable , 1992, Inf. Syst. Res..

[27]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[28]  Carlo Strapparava,et al.  The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language , 2009, ACL.

[29]  Shep Camp,et al.  Are you sure , 1913 .

[30]  Yimin Chen,et al.  Information manipulation classification theory for LIS and NLP , 2012, ASIST.

[31]  K. Crawford The Hidden Biases in Big Data , 2013 .

[32]  Elizabeth D. Liddy,et al.  Assessing Credibility of Weblogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[33]  Scott Adams Information ‐ a national resource , 1956 .

[34]  Robert W. Zmud,et al.  AN EMPIRICAL INVESTIGATION OF THE DIMENSIONALITY OF THE CONCEPT OF INFORMATION , 1978 .

[35]  Sabine Glesner,et al.  Editorial , 1864, Informatik - Forschung und Entwicklung.

[36]  James Pustejovsky,et al.  FactBank: a corpus annotated with event factuality , 2009, Lang. Resour. Evaluation.

[37]  Victoria L. Rubin,et al.  Extending information quality assessment methodology: A new veracity/deception dimension and its measures , 2012, ASIST.

[38]  John Dowell,et al.  Information seeking and use by newspaper journalists , 2003, J. Documentation.

[39]  D. McQuail McQuail's Mass Communication Theory , 2000 .

[40]  J. Nunamaker,et al.  Automating Linguistics-Based Cues for Detecting Deception in Text-Based Asynchronous Computer-Mediated Communications , 2004 .

[41]  Noriko Kando,et al.  Certainty Identification in Texts: Categorization Model and Manual Tagging Results , 2023 .

[42]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[43]  Victoria L. Rubin Epistemic modality: From uncertainty to certainty in the context of information seeking as interactions with texts , 2010, Inf. Process. Manag..

[44]  Jeffrey T. Hancock,et al.  On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication , 2007 .

[45]  B. J. Fogg,et al.  The elements of computer credibility , 1999, CHI '99.

[46]  Donald P. Ballou,et al.  Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems , 1985 .

[47]  Les Gasser,et al.  A framework for information quality assessment , 2007, J. Assoc. Inf. Sci. Technol..

[48]  Victoria L. Rubin,et al.  Truth and deception at the rhetorical structure level , 2015, J. Assoc. Inf. Sci. Technol..

[49]  G. K. Helleiner,et al.  The political economy of information in a changing international economic order , 1980, International Organization.

[50]  Roser Morante,et al.  Modality and Negation: An Introduction to the Special Issue , 2012, CL.

[51]  Jack G. Conrad,et al.  Professional credibility: authority on the web , 2008, WICOW '08.

[52]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[53]  Victoria L. Rubin Identifying certainty in texts , 2006 .

[54]  Stuart E. Madnick,et al.  Data quality requirements analysis and modeling , 2011, Proceedings of IEEE 9th International Conference on Data Engineering.

[55]  Dale Goodhue,et al.  Understanding user evaluations of information systems , 1995 .

[56]  Martin Frické,et al.  The knowledge pyramid: a critique of the DIKW hierarchy , 2009, J. Inf. Sci..

[57]  Jens-Erik Mai,et al.  The quality and qualities of information , 2013, J. Assoc. Inf. Sci. Technol..

[58]  Mark Davies The Corpus of Contemporary American English (COCA) , 2012 .

[59]  Victoria L. Rubin Stating with Certainty or Stating with Doubt: Intercoder Reliability Results for Manual Annotation of Epistemically Modalized Statements , 2007, NAACL.

[60]  R. Hardin,et al.  Conceptions and explanations of trust. , 2001 .

[61]  Diane M. Strong,et al.  AIMQ: a methodology for information quality assessment , 2002, Inf. Manag..

[62]  Carol Collier Kuhlthau,et al.  A Principle of Uncertainty for Information seeking , 1993, J. Documentation.

[63]  D. Larcker,et al.  Detecting Deceptive Discussions in Conference Calls , 2012 .

[64]  Vincent Mosco,et al.  The political economy of information , 1988 .

[65]  Noriko Kando,et al.  Certainty Categorization Model , 2004, AAAI 2004.