A framework for focused linked data crawler using context graphs

In this paper, we propose a framework for focused Linked Data (LD) crawler based on context graphs. A focused crawler searches for a specific subset of web, in our case it targets interlinked RDF data stores. The proposed crawler constructs set of context graphs for the given seed URIs by back crawling the web, and classifiers are trained to detect and assign documents to different categories based on the content type. These classifier help crawler in search and updating of context graphs automatically. The crawler are trained using supervised learning. Additionally, an extensive overview of existing LD crawlers is also provided along with its basic requirements, architecture, issues and challenges.

[1]  Pericles A. Mitkas,et al.  WebOWL: A Semantic Web search engine development experiment , 2012, Expert Syst. Appl..

[2]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[3]  Jürgen Umbrich,et al.  Data summaries for on-demand queries over linked data , 2010, WWW '10.

[4]  Vasant Honavar,et al.  Learning Relational Bayesian Classifiers from RDF Data , 2011, SEMWEB.

[5]  Pericles A. Mitkas,et al.  BioCrawler: An intelligent crawler for the semantic web , 2008, Expert Syst. Appl..

[6]  S. Raja,et al.  A Survey of Web Crawler Algorithms , 2011 .

[7]  Jürgen Umbrich,et al.  Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine , 2011, J. Web Semant..

[8]  Lambèr M. M. Royakkers,et al.  Ethical issues in web data mining , 2004, Ethics and Information Technology.

[9]  Marco A. Casanova,et al.  A Metadata Focused Crawler for Linked Data , 2014, ICEIS.

[10]  Timothy W. Finin,et al.  Swoogle: a search and metadata engine for the semantic web , 2004, CIKM '04.

[11]  Jürgen Umbrich,et al.  Four Heuristics to Guide Structured Content Crawling , 2008, 2008 Eighth International Conference on Web Engineering.

[12]  Jürgen Umbrich,et al.  MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data , 2006, SEMWEB.

[13]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[14]  Jürgen Umbrich,et al.  Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources , 2010, LDOW.

[15]  Divakar Singh,et al.  A SURVEY ON WEB CRAWLER , 2013 .

[16]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[17]  Yuzhong Qu,et al.  Falcons: searching and browsing entities on the semantic web , 2008, WWW.

[18]  Enrico Motta,et al.  Watson, more than a Semantic Web search engine , 2011, Semantic Web.

[19]  S. V. Kasmir Raja,et al.  Web Crawler in Mobile Systems , 2012 .

[20]  Jürgen Umbrich,et al.  LDspider: An Open-source Crawling Framework for the Web of Linked Data , 2010, SEMWEB.

[21]  Vasant Honavar,et al.  Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores , 2013, 2013 IEEE International Congress on Big Data.

[22]  Michael C. Loui,et al.  Taking the byte out of cookies: privacy, consent, and the Web , 1998, SIGCAS Comput. Soc..

[23]  Herman T. Tavani,et al.  Informational privacy, data mining, and the Internet , 1998, Ethics and Information Technology.

[24]  Aviral Aviral Nigam,et al.  Web Crawling Algorithms , 2014 .

[25]  Mohsen Kahani,et al.  A focused linked data crawler based on HTML link analysis , 2014, 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE).

[26]  Eyal Oren,et al.  Sindice.com: a document-oriented lookup index for open linked data , 2008, Int. J. Metadata Semant. Ontologies.