A topic-specific crawling strategy based on semantics similarity

With the Internet growing exponentially, search engines are encountering unprecedented challenges. A focused search engine selectively seeks out web pages that are relevant to user topics. Determining the best strategy to utilize a focused search is a crucial and popular research topic. At present, the rank values of unvisited web pages are computed by considering the hyperlinks (as in the PageRank algorithm), a Vector Space Model and a combination of them, and not by considering the semantic relations between the user topic and unvisited web pages. In this paper, we propose a concept context graph to store the knowledge context based on the user's history of clicked web pages and to guide a focused crawler for the next crawling. The concept context graph provides a novel semantic ranking to guide the web crawler in order to retrieve highly relevant web pages on the user's topic. By computing the concept distance and concept similarity among the concepts of the concept context graph and by matching unvisited web pages with the concept context graph, we compute the rank values of the unvisited web pages to pick out the relevant hyperlinks. Additionally, we constitute the focused crawling system, and we retrieve the precision, recall, average harvest rate, and F-measure of our proposed approach, using Breadth First, Cosine Similarity, the Link Context Graph and the Relevancy Context Graph. The results show that our proposed method outperforms other methods.

[1]  Juan Martínez-Romo,et al.  Updating broken web links: An automatic recommendation system , 2012, Inf. Process. Manag..

[2]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[3]  Sheng-Yuan Yang,et al.  OntoCrawler: A focused crawler with ontology-supported website models for information agents , 2010, Expert Syst. Appl..

[4]  Sergei O. Kuznetsov,et al.  Concept-based Recommendations for Internet Advertisement , 2009, ArXiv.

[5]  Evangelos E. Milios,et al.  PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING , 2012, Comput. Intell..

[6]  Geert-Jan Houben,et al.  Information Retrieval in Distributed Hypertexts , 1994, RIAO.

[7]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[8]  Claudio Carpineto,et al.  Using Concept Lattices for Text Retrieval and Mining , 2005, Formal Concept Analysis.

[9]  Yajun Du,et al.  Semantic ranking of web pages based on formal concept analysis , 2013, J. Syst. Softw..

[10]  Euripides G. M. Petrakis,et al.  Improving the performance of focused web crawlers , 2009, Data Knowl. Eng..

[11]  Veera Boonjing,et al.  A new case-based classification using incremental concept lattice knowledge , 2013, Data Knowl. Eng..

[12]  Fatemeh Ahmadi-Abkenari,et al.  An architecture for a focused trend parallel Web crawler with the application of clickstream analysis , 2012, Inf. Sci..

[13]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[14]  Hui Xiong,et al.  Scaling up top-K cosine similarity search , 2011, Data Knowl. Eng..

[15]  Julio Gonzalo,et al.  Browsing Search Results via Formal Concept Analysis: Automatic Selection of Attributes , 2004, ICFCA.

[16]  Rossitza Setchi,et al.  Semantic-based information retrieval in support of concept design , 2011, Adv. Eng. Informatics.

[17]  Peter W. Eklund,et al.  Concept similarity and related categories in information retrieval using formal concept analysis , 2012, Int. J. Gen. Syst..

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Amedeo Napoli,et al.  Many-Valued Concept Lattices for Conceptual Clustering and Information Retrieval , 2008, ECAI.

[20]  Fan Wu,et al.  Topic-specific crawling on the Web with the measurements of the relevancy context graph , 2006, Inf. Syst..

[21]  Kun Hua Tsai,et al.  Partially constructed knowledge for semantic query , 2009, Expert Syst. Appl..

[22]  Emmanuel Nauer,et al.  CreChainDo: an iterative and interactive Web information retrieval system based on lattices , 2009, Int. J. Gen. Syst..

[23]  Yajun Du,et al.  Topic-Specific Crawling on the Web with Concept Context Graph Based on FCA , 2009, 2009 International Conference on Management and Service Science.

[24]  S. Yadav,et al.  Search engine evaluation based on page level keywords , 2013, 2013 3rd IEEE International Advance Computing Conference (IACC).

[25]  Sergey Brin,et al.  Reprint of: The anatomy of a large-scale hypertextual web search engine , 2012, Comput. Networks.

[26]  Javad Akbari Torkestani An adaptive focused Web crawling algorithm based on learning automata , 2012, Applied Intelligence.

[27]  Jennifer Widom,et al.  Exploiting hierarchical domain structure to compute similarity , 2003, TOIS.

[28]  Hector Garcia-Molina,et al.  Reprint of: Efficient crawling through URL ordering , 2012, Comput. Networks.

[30]  Hector Garcia-Molina Pair-Wise entity resolution: overview and challenges , 2006, CIKM '06.

[31]  Anna Formica,et al.  Concept similarity in Formal Concept Analysis: An information content approach , 2008, Knowl. Based Syst..

[32]  Sergei O. Kuznetsov,et al.  Comparing performance of algorithms for generating concept lattices , 2002, J. Exp. Theor. Artif. Intell..

[33]  Benjamin P.-C. Yen,et al.  Design and evaluation of improvement method on the Web information navigation - a stochastic search approach , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[34]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[35]  Bjoern Koester,et al.  Conceptual Knowledge Retrieval with FooCA: Improving Web Search Engine Results with Contexts and Concept Hierarchies , 2006, ICDM.

[36]  Patricia Bouyer,et al.  Improved undecidability results on weighted timed automata , 2006, Inf. Process. Lett..

[37]  Huaxiang Zhang,et al.  SCTWC: An online semi-supervised clustering approach to topical web crawlers , 2010, Appl. Soft Comput..

[38]  P. Eklund,et al.  Information Retrieval and Social Tagging for Digital Libraries Using Formal Concept Analysis , 2010, 2010 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).

[39]  Philippe Mulhem,et al.  A relational vector space model using an advanced weighting scheme for image retrieval , 2011, Inf. Process. Manag..

[40]  YaJun Du,et al.  Strategy for mining association rules for web pages based on formal concept analysis , 2010, Appl. Soft Comput..

[41]  Yuekui Yang,et al.  Focused Web Crawling Based on Incremental Learning , 2010 .

[42]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[43]  Jonas Poelmans,et al.  Formal Concept Analysis in knowledge processing: A survey on models and techniques , 2013, Expert Syst. Appl..

[44]  Sandeep Purao,et al.  Evaluating the adoption potential of design science efforts: The case of APSARA , 2008, Decis. Support Syst..

[45]  Anna Formica,et al.  Ontology-based concept similarity in Formal Concept Analysis , 2006, Inf. Sci..

[46]  Evangelos E. Milios,et al.  Using HMM to learn user browsing patterns for focused Web crawling , 2006, Data & Knowledge Engineering.

[47]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .