Focused crawler for events

There is need for an Integrated Event Focused Crawling system to collect Web data about key events. When a disaster or other significant event occurs, many users try to locate the most up-to-date information about that event. Yet, there is little systematic collecting and archiving anywhere of event information. We propose intelligent event focused crawling for automatic event tracking and archiving, ultimately leading to effective access. We developed an event model that can capture key event information, and incorporated that model into a focused crawling algorithm. For the focused crawler to leverage the event model in predicting webpage relevance, we developed a function that measures the similarity between two event representations. We then conducted two series of experiments to evaluate our system about two recent events: California shooting and Brussels attack. The first experiment series evaluated the effectiveness of our proposed event model representation when assessing the relevance of webpages. Our event model-based representation outperformed the baseline method (topic-only); it showed better results in precision, recall, and F1-score with an improvement of 20% in F1-score. The second experiment series evaluated the effectiveness of the event model-based focused crawler for collecting relevant webpages from the WWW. Our event model-based focused crawler outperformed the state-of-the-art baseline focused crawler (best-first); it showed better results in harvest ratio with an average improvement of 40%.

[1]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[2]  James Allan,et al.  Introduction to topic detection and tracking , 2002 .

[3]  Edward A. Fox,et al.  Digital Library Technologies: Complex Objects, Annotation, Ontologies, Classification, Extraction, and Security , 2014, Digital Library Technologies.

[4]  Filippo Menczer,et al.  Scalable Web Search by Adaptive Online Agents: An InfoSpiders Case Study , 1999 .

[5]  Ioannis Pitas,et al.  Combining text and link analysis for focused crawling - An application for vertical search engines , 2007, Inf. Syst..

[6]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[7]  Guilherme Tavares de Assis,et al.  Exploiting Genre in Focused Crawling , 2007, SPIRE.

[8]  Marc Ehrig,et al.  Ontology-focused crawling of Web documents , 2003, SAC '03.

[9]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[10]  Edward A. Fox,et al.  Digital Libraries Applications: CBIR, Education, Social Networks, eScience/Simulation, and GIS , 2014, Digital Libraries Applications.

[11]  Jason Renniey,et al.  Eecient Web Spidering with Reinforcement Learning , 1999 .

[12]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[13]  Ramesh Jain,et al.  Toward a Common Event Model for Multimedia Applications , 2007, IEEE MultiMedia.

[14]  Michael L. Nelson,et al.  Detecting Off-Topic Pages in Web Archives , 2015, TPDL.

[15]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[16]  Mohamed Magdy Gharib Farag,et al.  Intelligent Event Focused Crawling , 2016 .

[17]  Georgios Paliouras,et al.  Focused Crawling Using Temporal Difference-Learning , 2004, SETN.

[18]  Madely du Preez Digital Library Technologies: Complex Objects, Annotation, Ontologies, Classification, Extraction and Security , 2014, Online Inf. Rev..

[19]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[20]  Nidhi Singh,et al.  Large Scale URL-based Classification Using Online Incremental Learning , 2012, 2012 11th International Conference on Machine Learning and Applications.

[21]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[22]  Yuxin Chen,et al.  A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections , 2007 .

[23]  Edward A. Fox,et al.  Key Issues Regarding Digital Libraries: Evaluation and Integration , 2013, Key Issues Regarding Digital Libraries: Evaluation and Integration.

[24]  Michael L. Nelson,et al.  Is this a good title? , 2010, HT '10.

[25]  Michael Gertz,et al.  An event-centric model for multilingual document similarity , 2011, SIGIR '11.

[26]  Thomas Risse,et al.  iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling , 2015, JCDL.

[27]  Edward A. Fox,et al.  Big data processing of school shooting archives , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[28]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[29]  Elizabeth Chang,et al.  A survey in semantic web technologies-inspired focused crawlers , 2008, 2008 Third International Conference on Digital Information Management.

[30]  Padmini Srinivasan,et al.  Status Locality on the Web: Implications for Building Focused Collections , 2013, Inf. Syst. Res..

[31]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[32]  Sergey Brin,et al.  Reprint of: The anatomy of a large-scale hypertextual web search engine , 2012, Comput. Networks.

[33]  Padmini Srinivasan,et al.  Predicting Web Page Status , 2008, Inf. Syst. Res..

[34]  Euripides G. M. Petrakis,et al.  Improving the performance of focused web crawlers , 2009, Data Knowl. Eng..

[35]  Doina Caragea,et al.  Animal Disease Event Recognition and Classication , 2010 .

[36]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..