Development of a 'fake news' machine learning classifier and a dataset for its testing

Fabricated news stories that contain false information but are presented as factually accurate (commonly known as ‘fake news’) have generated substantial interest and media attention following the 2016 U.S. presidential election. While the full details of what transpired during the election are still not known, it appears that multiple groups used social media to spread false information packaged in fabricated news articles that were presented as truthful. Some have argued that this campaign had a material impact on the election. Moreover, the 2016 U.S. presidential election is far from the only campaign where fake news had an apparent role. In this paper, work on a counter-fake-news research effort is presented. In the long term, this project is focused on building an indications and warnings systems for potentially deceptive false content. As part of this project, a dataset of manually classified legitimate and deceptive news articles was curated. The key criteria for classifying legitimate and deceptive articles, identified by the manual classification project, are identified and discussed. The identified criteria can be embodied in a natural language processing system to perform illegitimate content detection. The criteria include the document’s source and origin, title, political perspective, and several key content characteristics. This paper presents and evaluates the efficacy of each of these characteristics and their suitability for legitimate versus illegitimate classification. The paper concludes by discussing the use of these characteristics as input to a customized naïve Bayesian probability classifier, the results of the use of this classifier and future work on its development.