Crawling the Hidden Web: An Approach to Dynamic Web Indexing

majority of the websites encapsulating online information are dynamic and hence too sophisticated for many traditional search engines to index. With the ever growing quantity of such hidden web pages, this issue continues to raise diverse opinions between the research and practitioner among the web mining communities. Several aspects enriching these dynamic web pages are bringing more challenges day-by-day to index them. By explaining these aspects and challenges, in this paper we have presented a framework for dynamic web indexing. With the implementation of this framework and the results which we have found from it, all the necessary experimental setup and the developmental processes are explained. We have concluded by exposing a possible future scope through the integration of Hadoop-Mapreduce with this framework to update and maintain the index.

[1]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[2]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[3]  Hasan Mahmud,et al.  A framework for dynamic indexing from hidden web , 2011 .

[4]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[5]  Hui Chen,et al.  Automatic information discovery from the "invisible Web" , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[6]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[7]  Abhishek Singh Yadav,et al.  NewNet- Crawling Deep Web , 2010 .

[8]  David J. Hand,et al.  Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage by Zdravko Markov, Daniel T. Larose , 2007 .

[9]  Judit Bar-Ilan,et al.  Methods for comparing rankings of search engine results , 2005, Comput. Networks.

[10]  Wei-Ying Ma,et al.  Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[14]  Wang Hui-chang,et al.  The Implementation of a Web Crawler URL Filter Algorithm Based on Caching , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[15]  Hemant Kumar Singh,et al.  Web Data Mining research: A survey , 2010, 2010 IEEE International Conference on Computational Intelligence and Computing Research.

[16]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.