On detecting cherry-picked trendlines

Poorly supported stories can be told based on data by cherry-picking the data points included. While such stories may be technically accurate, they are misleading. In this paper, we build a system for detecting cherry-picking, with a focus on trendlines extracted from temporal data. We define a support metric for detecting such trendlines. Given a dataset and a statement made based on a trendline, we compute a support score that indicates how cherry-picked it is. Studying different types of trendlines and formalizing terms, we propose efficient and effective algorithms for computing the support measure. We also study the problem of discovering the most supported statements. Besides theoretical analysis, we conduct extensive experiments on real-world data, that demonstrate the validity of our proposed techniques.

[1]  Peter J. Haas,et al.  The monte carlo database system: Stochastic analysis close to the data , 2011, TODS.

[2]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[3]  H. V. Jagadish,et al.  DaNaLIX: a domain-adaptive natural language interface for querying XML , 2007, SIGMOD '07.

[4]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[5]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2016 .

[6]  Reza Zafarani,et al.  Fake News: A Survey of Research, Detection Methods, and Opportunities , 2018, ArXiv.

[7]  Christopher Ré,et al.  Probabilistic databases: diamonds in the dirt , 2009, CACM.

[8]  Benno Stein,et al.  A Stylometric Inquiry into Hyperpartisan and Fake News , 2017, ACL.

[9]  Pankaj K. Agarwal,et al.  Computational Fact Checking through Query Perturbations , 2017, ACM Trans. Database Syst..

[10]  Jens Lehmann,et al.  Belittling the Source: Trustworthiness Indicators to Obfuscate Fake News on the Web , 2018, ArXiv.

[11]  Chengkai Li,et al.  ClaimBuster: The First-ever End-to-end Fact-checking System , 2017, Proc. VLDB Endow..

[12]  Dina Pisarevskaya,et al.  Deception Detection in News Reports in the Russian Language: Lexics and Discourse , 2017, NLPmJ@EMNLP.

[13]  Pankaj K. Agarwal,et al.  Toward Computational Fact-Checking , 2014, Proc. VLDB Endow..

[14]  Abolfazl Asudeh,et al.  MithraRanking: A System for Responsible Ranking Design , 2019, SIGMOD Conference.

[15]  H. V. Jagadish,et al.  Constructing a Generic Natural Language Interface for an XML Database , 2006, EDBT.

[16]  Wei Gao,et al.  Rumor Detection on Twitter with Tree-structured Recursive Neural Networks , 2018, ACL.

[17]  Fred J. Hickernell,et al.  Guaranteed Conservative Fixed Width Confidence Intervals Via Monte Carlo Sampling , 2012, 1208.4318.

[18]  Rémi Bardenet,et al.  Monte Carlo Methods , 2013, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[19]  H. Lee,et al.  A data abstraction approach for query relaxation , 2000, Inf. Softw. Technol..

[20]  Svitlana Volkova,et al.  Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter , 2017, ACL.

[21]  Yanjie Fu,et al.  Fake News Detection with Deep Diffusive Network Model , 2018, ArXiv.

[22]  Chengkai Li,et al.  Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by ClaimBuster , 2017, KDD.

[23]  Huan Liu,et al.  Exploiting Tri-Relationship for Fake News Detection , 2017, ArXiv.

[24]  Gary D. Bond,et al.  ‘Lyin' Ted’, ‘Crooked Hillary’, and ‘Deceptive Donald’: Language of Lies in the 2016 US Presidential Debates , 2017 .

[25]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[26]  Dmitri V. Kalashnikov,et al.  Progressive Approach to Relational Entity Resolution , 2014, Proc. VLDB Endow..

[27]  Cong Yu,et al.  Data In, Fact Out: Automated Monitoring of Facts by FactWatcher , 2014, Proc. VLDB Endow..

[28]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[29]  Pankaj K. Agarwal,et al.  Finding Diverse, High-Value Representatives on a Surface of Answers , 2017, Proc. VLDB Endow..

[30]  Sinan Aral,et al.  The spread of true and false news online , 2018, Science.

[31]  Naeemul Hassan,et al.  The Quest to Automate Fact-Checking , 2015 .

[32]  Pushpak Bhattacharyya,et al.  Relation Extraction : A Survey , 2017, ArXiv.

[33]  Chengkai Li,et al.  Detecting Check-worthy Factual Claims in Presidential Debates , 2015, CIKM.

[34]  Nayer M. Wanas,et al.  Web-based statistical fact checking of textual documents , 2010, SMUC '10.

[35]  Abolfazl Asudeh,et al.  Designing Fair Ranking Schemes , 2017, SIGMOD Conference.

[36]  Wei Zhang,et al.  Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources , 2015, Proc. VLDB Endow..

[37]  Jiawei Han,et al.  Evaluating Event Credibility on Twitter , 2012, SDM.

[38]  Kenny Q. Zhu,et al.  False rumors detection on Sina Weibo by propagation structures , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[39]  Abolfazl Asudeh,et al.  On Obtaining Stable Rankings , 2018, Proc. VLDB Endow..

[40]  Sarah Cohen,et al.  Computational journalism , 2011, Commun. ACM.

[41]  Yufei Tao Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space , 2018, ICDT.

[42]  Jia-Ling Koh,et al.  The Strategies for Supporting Query Specialization and Query Generalization in Social Tagging Systems , 2013, DASFAA Workshops.

[43]  Yongdong Zhang,et al.  News Verification by Exploiting Conflicting Social Viewpoints in Microblogs , 2016, AAAI.

[44]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[45]  Laura M. Haas,et al.  Information integration in the enterprise , 2008, CACM.

[46]  Surajit Chaudhuri Generalization and a framework for query modification , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[47]  Christian S. Jensen,et al.  Temporal Specialization and Generalization , 1994, IEEE Trans. Knowl. Data Eng..