An improved focused crawler based on Semantic Similarity Vector Space Model

An improved retrieval model - the Semantic Similarity Vector Space Model (SSVSM).The proposed model accurately predicts the unvisited URLs - priorities to the given topic.The proposed model guides focused crawlers to download large quantity and high quality web pages. A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. In many studies, the Vector Space Model (VSM) and Semantic Similarity Retrieval Model (SSRM) take advantage of cosine similarity and semantic similarity to compute similarities between web pages and the given topic. However, if there are no common terms between a web page and the given topic, the VSM will not obtain the proper topical similarity of the web page. In addition, if all of the terms between them are synonyms, then the SSRM will also not obtain the proper topical similarity. To address these problems, this paper proposes an improved retrieval model, the Semantic Similarity Vector Space Model (SSVSM), which integrates the TF*IDF values of the terms and the semantic similarities among the terms to construct topic and document semantic vectors that are mapped to the same double-term set, and computes the cosine similarities between these semantic vectors as topic-relevant similarities of documents, including the full texts and anchor texts of unvisited hyperlinks. Next, the proposed model predicts the priorities of the unvisited hyperlinks by integrating the full text and anchor text topic-relevant similarities. The experimental results demonstrate that this approach improves the performance of the focused crawlers and outperforms other focused crawlers based on Breadth-First, VSM and SSRM. In conclusion, this method is significant and effective for focused crawlers.

[1]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[2]  ChunZhi Xie,et al.  An approach for selecting seed URLs of focused crawler based on user-interest ontology , 2014, Appl. Soft Comput..

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  David Sánchez,et al.  Towards the estimation of feature-based semantic similarity using multiple ontologies , 2014, Knowl. Based Syst..

[5]  Huaxiang Zhang,et al.  SCTWC: An online semi-supervised clustering approach to topical web crawlers , 2010, Appl. Soft Comput..

[6]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[7]  Euripides G. M. Petrakis,et al.  Information Retrieval by Semantic Similarity , 2006, Int. J. Semantic Web Inf. Syst..

[8]  Teruaki Kitasuka,et al.  An Effectively Focused Crawling System , 2012 .

[9]  Euripides G. M. Petrakis,et al.  Improving the performance of focused web crawlers , 2009, Data Knowl. Eng..

[10]  David Sánchez,et al.  A semantic similarity method based on information content exploiting multiple ontologies , 2013, Expert Syst. Appl..

[11]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[12]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[13]  Sougata Mukherjea,et al.  Organizing topic-specific web information , 2000, HYPERTEXT '00.

[14]  Debashis Hati,et al.  Improved focused crawling approach for retrieving relevant pages based on block partitioning , 2010, 2010 2nd International Conference on Education Technology and Computer.

[15]  Jonas Poelmans,et al.  Text mining with emergent self organizing maps and multi-dimensional scaling: A comparative study on domestic violence , 2011, Appl. Soft Comput..

[16]  Wenjun Liu,et al.  A novel focused crawler based on cell-like membrane computing optimization algorithm , 2014, Neurocomputing.

[17]  Euripides G. M. Petrakis,et al.  Semantic similarity methods in wordNet and their application to information retrieval on the web , 2005, WIDM '05.

[18]  Ahmed Patel,et al.  Application of structured document parsing to focused web crawling , 2011, Comput. Stand. Interfaces.

[19]  Yajun Du,et al.  Semantic ranking of web pages based on formal concept analysis , 2013, J. Syst. Softw..

[20]  Fatemeh Ahmadi-Abkenari,et al.  An architecture for a focused trend parallel Web crawler with the application of clickstream analysis , 2012, Inf. Sci..

[21]  YaJun Du,et al.  A topic-specific crawling strategy based on semantics similarity , 2013, Data Knowl. Eng..

[22]  Hema Banati,et al.  Focused crawling of tagged web resources using ontology , 2013, Comput. Electr. Eng..

[23]  Fan Wu,et al.  Topic-specific crawling on the Web with the measurements of the relevancy context graph , 2006, Inf. Syst..

[24]  Lu Liu,et al.  Focused crawling enhanced by CBP-SLC , 2013, Knowl. Based Syst..

[25]  Ying Zhao,et al.  Focused Crawler Based on Domain Ontology and FCA , 2011 .

[26]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[27]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[28]  Yajun Du,et al.  Topic-Specific Crawling on the Web with Concept Context Graph Based on FCA , 2009, 2009 International Conference on Management and Service Science.

[29]  Renu Vig,et al.  Learnable Focused Meta Crawling Through Web , 2012 .

[30]  Frano Skopljanac-Macina,et al.  Formal Concept Analysis – Overview and Applications , 2014 .