Sentiment crawling: Extremist content collection through a sentiment analysis guided web-crawler

As the data generated on the internet exponentially increases, developing guided data collection methods become more and more essential to the research process. This paper proposes an approach to building a self-guiding web-crawler to collect data specifically from extremist websites. The guidance component of the web-crawler is achieved through the use of sentiment-based classification rules which allow the crawler to make decisions on the content of the webpage it downloads. First, content from 2,500 webpages was collected for each of the four different sentiment-based classes: pro-extremist websites, anti-extremist websites, neutral news sites discussing extremism and finally sites with no discussion of extremism. Then parts of speech tagging was used to find the most frequent keywords in these pages. Utilizing sentiment software in conjunction with classification software a decision tree that could effectively discern which class a particular page would fall into was generated. The resulting tree showed an 80% success rate on differentiating between the four classes and a 92% success rate at classifying specifically extremist pages. This decision tree was then applied to a randomly selected sample of pages for each class. The results from the secondary test showed similar results to the primary test and hold promise for future studies using this framework.

[1]  Vincent A. Knight,et al.  Tweeting the terror: modelling the social media reaction to the Woolwich terrorist attack , 2014, Social Network Analysis and Mining.

[2]  David Zimbra,et al.  Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network , 2013, Expert Syst. Appl..

[3]  Mike Thelwall,et al.  Topic-based sentiment analysis for the social web: The role of mood and issue-related words , 2013, J. Assoc. Inf. Sci. Technol..

[4]  Adel M. Alimi,et al.  A Multi-Agent Based System for Sentiment Analysis of User-Generated Content , 2013, Int. J. Artif. Intell. Tools.

[5]  Ronen Feldman,et al.  Techniques and applications for sentiment analysis , 2013, CACM.

[6]  A. Nauta 7. Radical Islam, Globalisation and Social Media: Martyrdom Videos on the Internet , 2013 .

[7]  Donald Holbrook Al-Qaeda's Response to the Arab Spring , 2012 .

[8]  H. Kennedy Perspectives on Sentiment Analysis , 2012 .

[9]  Mykola Pechenizkiy,et al.  Mobile Sentiment Analysis , 2012, KES.

[10]  Martin Bouchard,et al.  Comparing Methods for Detecting Child Exploitation Content Online , 2012, 2012 European Intelligence and Security Informatics Conference.

[11]  Martin Bouchard,et al.  Finding the Key Players in Online Child Exploitation Networks , 2011 .

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[14]  Hsinchun Chen,et al.  US domestic extremist groups on the Web: link and content analysis , 2005, IEEE Intelligent Systems.

[15]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[16]  Emmanuel Karagiannis Political Islam and Social Movement Theory: The Case of Hizb ut-Tahrir in Kyrgyzstan , 2005 .

[17]  J. Nedoroščík Extremist Groups in Egypt , 2002 .

[18]  Martin Bouchard,et al.  Preliminary Analytical Considerations in Designing a Terrorism and Extremism Online Network Extractor , 2014, Computational Models of Complex Systems.

[19]  Jeremy Ellman,et al.  Using SentiWordNet and Sentiment Analysis for Detecting Radical Content on Web Forums , 2012 .

[20]  Justin C. Altum Anti-Abortion Extremism : The Army of God , 2004 .