Distributed Web2.0 crawling for ontology evolution

Semantic Web technologies in general and ontologybased approaches in particular are considered the foundation for the next generation of information services. While ontologies enable software agents to exchange knowledge and information in a standardised, intelligent manner, describing todays vast amount of information in terms of ontological knowledge and to track the evolution of such ontologies remains a challenge. In this paper we describe Web2.0 crawling for ontology evolution. The World Wide Web, or Web for short, is due, its evolutionary properties and social network characteristics a perfect fitting data source to evolve an ontology. The decentralised structure of the Internet, the huge amount of data and upcoming Web2.0 technologies arise several challenges for a crawling system. In this paper we present a distributed crawling system with standard browser integration. The proposed system is a high performance, sitescript based noise reducing crawler, which loads standard browser equivalent content from Web2.0 resources. Furthermore we describe the integration of this spider into our ontology evolution framework.

[1]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[2]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[3]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[6]  Chris DiBona,et al.  Open Sources: Voices from the Open Source Revolution , 1999 .

[7]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[8]  Marios D. Dikaiakos,et al.  High-Performance Crawling and Filtering in Java , 2001 .

[9]  Sebastiano Vigna,et al.  Trovatore: Towards a Highly Scalable Distributed Web Crawler , 2001, WWW Posters.

[10]  Charles L. A. Clarke,et al.  Topic-oriented collaborative crawling , 2002, CIKM '02.

[11]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[12]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[13]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[14]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[16]  Ljiljana Stojanovic,et al.  Methods and tools for ontology evolution , 2004 .

[17]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[18]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Elizabeth Chang,et al.  Semi-Automatic Ontology Extension Using Spreading Activation , 2005 .

[21]  Daniel Lewis,et al.  What is web 2.0? , 2006, CROS.

[22]  P. Anderson What is Web 2.0? Ideas, technologies and implications for education , 2007 .

[23]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .