Fake news identification: a comparison of parts-of-speech and N-grams with neural networks

The rise of the internet has enabled fake news to reach larger audiences more quickly. As more people turn to social media for news, the accuracy of information on these platforms is especially important. To help enable classification of the accuracy news articles at scale, machine learning models have been developed and trained to recognize fake articles. Previous linguistic work suggests part-of-speech and N-gram frequencies are often different between fake and real articles. To compare how these frequencies relate to the accuracy of the article, a dataset of 260 news articles, 130 fake and 130 real, was collected for training neural network classifiers. The first model relies solely on part-of-speech frequencies within the body of the text and consistently achieved 82% accuracy. As the proportion of the dataset used for training grew smaller, accuracy decreased, as expected. The true negative rate, however, remained high. Thus, some aspect of the fake articles was readily identifiable, even when the classifier was trained on a limited number of examples. The second model relies on the most commonly occurring N-gram frequencies. The neural nets were trained on N-grams of different length. Interestingly, the accuracy was near 61% for each N-gram size. This suggests some of the same information may be ascertainable across N-grams of different sizes.