Applying semantic similarity measures to enhance topic-specific web crawling

As the Internet grows rapidly, finding desirable information becomes a tedious and time consuming task. Topic-specific web crawlers, as utopian solutions, tackle this issue through traversing the Web and collecting information related to the topic of interest. In this regard, various methods are proposed. Nevertheless, they hardly consider desired sense of the given topic which would certainly play an important role to find relevant web pages. In this paper, we attempt to improve topic-specific web crawling by disambiguating the sense of the topic. This would avoid crawling irrelevant links interlaced with other senses of the topic. For this purpose, by considering links hypertext semantic, we employ Lin semantic similarity measure in our crawler, named LinCrawler, to distinguish topic sense-related links from the others. Moreover, we compare LinCrawler against TFCrawler which only considers frequency of terms in hypertexts. Experimental results show LinCrawler outperforms TFCrawler to collect more relevant web pages.

[1]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[2]  Aysu Betin Can,et al.  MedicoPort: A medical search engine for all , 2007, Comput. Methods Programs Biomed..

[3]  G. Aghila,et al.  Ontology-based Web crawler , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[4]  Li Zhang,et al.  Focused crawling using navigational rank , 2010, CIKM '10.

[5]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[7]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[8]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[9]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[10]  James J. Cimino,et al.  Towards the development of a conceptual distance metric for the UMLS , 2004, J. Biomed. Informatics.

[11]  Norwati Mustapha,et al.  Term frequency-information content for focused crawling to predict relevant web pages. , 2013 .

[12]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[13]  Ioannis Pitas,et al.  Combining text and link analysis for focused crawling - An application for vertical search engines , 2007, Inf. Syst..

[14]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[15]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[16]  Chunxia Yin,et al.  A Novel Method for Crawler in Domain-specific Search , 2010 .

[17]  Arputharaj Kannan,et al.  LSCrawler: A Framework for an Enhanced Focused Web Crawler Based on Link Semantics , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[18]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[20]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[21]  Yong Yu,et al.  Conceptual Graph Matching for Semantic Search , 2002, ICCS.

[22]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.