Development of an intelligent distributed news retrieval system

Currently available web news retrieval systems face a number of problems in that web-based news retrieval requires the ability to quickly and accurately process and update a very large amount of data which are constantly being updated. In this paper, we present the development of an intelligent distributed web news retrieval system the goal of which is to accurately retrieve and organize the web news information. It includes: a novel optimized crawler algorithm whose fetching-speed is several times faster than that of the traditional crawler; a keen tag based extraction algorithm which can extract the data rich content with minimal manual effort and which also allows data to be classified as important or not important so that the crawler can revisit and update important data; a modified MapReduce improved by estimating the execution time of each subtask, which is proven to be able to reduce the number of the unusual tasks and shorten the whole job execution time.

[1]  Boris Chidlovskii,et al.  Crawling for domain-specific hidden Web resources , 2003, Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003..

[2]  C. Lee Giles,et al.  Designing efficient sampling techniques to detect webpage updates , 2007, WWW '07.

[3]  Joemon M. Jose,et al.  A comparative study of online news retrieval and presentation strategies , 2004, IEEE Sixth International Symposium on Multimedia Software Engineering.

[4]  James Nga-Kwok Liu,et al.  Design and Implement a Web News Retrieval System , 2005, KES.

[5]  Steve Renals,et al.  The THISL system for indexing and retrieval of broadcast news , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[6]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[7]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[8]  Weiyi Meng,et al.  WIRE-a WWW-based information retrieval and extraction system , 1998, Proceedings Ninth International Workshop on Database and Expert Systems Applications (Cat. No.98EX130).

[9]  Yasuo Ariki,et al.  A TV news retrieval system with interactive query function , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[10]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Wing Shing Wong,et al.  A probabilistic model for intelligent Web crawlers , 2003, Proceedings 27th Annual International Computer Software and Applications Conference. COMPAC 2003.

[13]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD 2000.

[14]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[15]  Hector Garcia-Molina,et al.  Crawling the web: discovery and maintenance of large-scale web data , 2001 .

[16]  Sang Ho Lee,et al.  On URL Normalization , 2005, ICCSA.

[17]  Marina Buzzi,et al.  Cooperative crawling , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[18]  Yiu-Kai Ng,et al.  Categorizing and extracting information from multilingual HTML documents , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).