WISDOM: Web intrapage informative structure mining based on document object model

To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web intrapage informative structure mining based on the document object model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM'S practical applicability.

[1]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[2]  Benjamin W. Wah,et al.  Editorial: Two Named to Editorial Board of IEEE Transactions on Knowledge and Data Engineering , 1996 .

[3]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[4]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[5]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[6]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[7]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[8]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[9]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[10]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[11]  Wai Lam,et al.  Learning to extract hierarchical information from semi-structured documents , 2000, CIKM '00.

[12]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[13]  Ke Wang,et al.  Discovering Structural Association of Semistructured Data , 2000, IEEE Trans. Knowl. Data Eng..

[14]  T. Scheffer,et al.  Clipping and Analyzing News Using Machine Learning Techniques , 2001, Discovery Science.

[15]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[16]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[17]  Yusuke Suzuki,et al.  Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents , 2002, PAKDD.

[18]  Ming-Syan Chen,et al.  Entropy-based link analysis for mining web informative structures , 2002, CIKM '02.

[19]  Xiaoli Li,et al.  Using micro information units for internet search , 2002, CIKM '02.

[20]  Hiroki Arimura,et al.  Optimized Substructure Discovery for Semi-structured Data , 2002, PKDD.

[21]  Michael Gertz,et al.  Reverse engineering for Web data: from visual to semantic structures , 2002, Proceedings 18th International Conference on Data Engineering.

[22]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[23]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[24]  Ke Wang,et al.  Discovering Frequent Substructures from Hierarchical Semi-structured Data , 2002, SDM.

[25]  Takayoshi Shoudai,et al.  Extracting Characteristic Structures among Words in Semistructured Documents , 2002, PAKDD.

[26]  Wei-Ying Ma,et al.  Detecting web page structure for adaptive viewing on small form factor devices , 2003, WWW '03.

[27]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[28]  Ming-Syan Chen,et al.  Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[29]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[30]  Robert Richards,et al.  Document Object Model (DOM) , 2006 .