Discovering informative content blocks from Web documents

In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.

[1]  Richard E. Blahut,et al.  Principles and practice of information theory , 1987 .

[2]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[3]  Dennis Shasha,et al.  New techniques for best-match retrieval , 1990, TOIS.

[4]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[5]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[6]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[7]  Claire Cardie,et al.  Empirical Methods in Information Extraction , 1997, AI Mag..

[8]  John Bear,et al.  Using Information Extraction to Improve Document Retrieval , 1997, TREC.

[9]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[10]  Filippo Neri,et al.  Machine Learning for Information Extraction , 1997, SCIE.

[11]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[14]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[15]  W. Kinsner,et al.  Hypertext Markup Language , 1999 .

[16]  Ke Wang,et al.  Discovering Structural Association of Semistructured Data , 2000, IEEE Trans. Knowl. Data Eng..

[17]  Boris Chidlovskii Wrapper generation by -reversible grammar induction , 2000 .

[18]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[19]  Maarten de Rijke,et al.  Wrapper Generation via Grammar Induction , 2000, ECML.

[20]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[21]  C. Nédellec Machine Learning for Information Extraction , 2001 .

[22]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[23]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.