Mining data records in Web pages

A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records in order to extract information from them to provide value-added services. Existing automatic techniques are not satisfactory because of their poor accuracies. In this paper, we propose a more effective technique to perform the task. The technique is based on two observations about data records on the Web and a string matching algorithm. The proposed technique is able to mine both contiguous and non-contiguous data records. Our experimental results show that the proposed technique outperforms existing techniques substantially.

[1]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[2]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[3]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[4]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[5]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[6]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[7]  Jaeyoung Yang,et al.  MORPHEUS: a more scalable comparison-shopping agent , 2001, AGENTS '01.

[8]  Jiawei Han,et al.  Data Mining for Web Intelligence , 2002, Computer.

[9]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[10]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[11]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[12]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[13]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[14]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[15]  Andrew McCallum,et al.  An Interoperable Multimedia Catalog System for Electronic Commerce. , 2000 .

[16]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[17]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[18]  Dan Gusfield,et al.  Algorithms on strings , 1997 .

[19]  Ricardo A. Baeza-Yates,et al.  Algorithms for string searching , 1989, SIGF.

[20]  Huang Yu Extracting Semi-Structured Information from the WEB , 2000 .

[21]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[22]  Arnaud Sahuguet,et al.  WysiWyg Web Wrapper Factory (W4F) , 1999 .

[23]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[24]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[25]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences: Multiple String Comparison – The Holy Grail , 1997 .