Information retrieval from large number of Web sites

Many Web information retrieval methods are related to special Web sites, for example, the method based on extracting rules and the one based on training page samples. These methods can do well in a Web site but fail in the others without adding new rules or inputting new training pages manually. Furthermore, if the template of the Web site is changed, it has to redesign the extracting rules or re-inputting the training pages. It is hard to be maintained and used to extract information from large number of different Web sites. In the paper, there is a new method that is based on the keywords of a certain topic, instead of rules and training pages. Experimental evaluation on a large of Web pages from different Web sites indicates that this method correctly and automatically extracts the information ignoring which Web sites the pages come from. This method has been applied to the system of intelligent searching and mining of electronic business successfully.

[1]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages (Poster) , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[2]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[3]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[4]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[5]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[6]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[7]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[10]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[11]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .