Topic Crawler for Social Networks Monitoring

Paper describes a focused crawler for monitoring social networks which is used for information extraction and content analysis. Crawler implements MapReduce model for distributed computations and is oriented to big text data. Focused crawler allows to look for the pages classified as relevant to the specified topic. Classifier is build using knowledge database that defines words, their classes and rules of joining words into the phrases. Based on the weights of words and phrases the text weight which indicates relevance to the topic is obtained. This system was used to detect drug community in Russian segment of Livejournal social network. Official and slang drug terminology was implemented to develop knowledge database. Different aspects of knowledge database and classifier are studied. The non-homogeneous Poisson process was used to model blogs changing since it permits to build a monitoring policy that includes blogs update frequency and day-time effect. Evaluation on real data shows 25% increase in new posts detection.

[1]  Michael J. Cafarella,et al.  Building Nutch: Open Source Search , 2004, ACM Queue.

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[4]  John Yen,et al.  Advances in Web Mining and Web Usage Analysis, 8th International Workshop on Knowledge Discovery on the Web, WebKDD 2006, Philadelphia, PA, USA, August 20, 2006, Revised Papers , 2007, WebKDD.

[5]  Christos Faloutsos,et al.  Parallel crawling for online social networks , 2007, WWW '07.

[6]  Mehdi Ravakhah,et al.  Semantic Similarity Based Focused Crawling , 2009, 2009 First International Conference on Computational Intelligence, Communication Systems and Networks.

[7]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[8]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[9]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD 2000.

[10]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[11]  José Martins,et al.  TwitterEcho: a distributed focused crawler to support open research with twitter data , 2012, WWW.

[12]  Pablo Castells,et al.  An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Stephan Bloehdorn,et al.  Text classification by boosting weak learners based on terms and concepts , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Hyun-Kyu Cho,et al.  Efficient Monitoring Algorithm for Fast News Alerts , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[16]  Seong-Bae Park,et al.  An Automatic Approach to Classify Web Documents Using a Domain Ontology , 2005, PReMI.

[17]  Martin Halvey,et al.  WWW '07: Proceedings of the 16th international conference on World Wide Web , 2007, WWW 2007.

[18]  Michael I. Jordan,et al.  Modeling Events with Cascades of Poisson Processes , 2010, UAI.