Web page title extraction and its application

This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a 'definition' on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (direct object model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6-37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (25.1-30.3% improvements).

[1]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[2]  James P. Callan,et al.  Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding , 2003, TREC.

[3]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Shuming Shi,et al.  Title extraction from bodies of HTML documents and its application to web page retrieval , 2005, SIGIR '05.

[6]  Maurice Bruynooghe,et al.  Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference , 2003, IJCAI.

[7]  Tao Qin,et al.  Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004 , 2004, TREC.

[8]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[9]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[10]  Timothy C. Craven HTML Tags as Extraction Cues for Web Page Description Construction , 2003, Informing Sci. Int. J. an Emerg. Transdiscipl..

[11]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[12]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[13]  Line Eikvil,et al.  Information Extraction from World Wide Web - A Survey , 1999 .

[14]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[15]  Kevyn Collins-Thompson,et al.  Information Filtering, Novelty Detection, and Named-Page Finding , 2002, TREC.

[16]  Abdel Belaïd Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[17]  Yiqun Liu,et al.  THU TREC 2002: Novelty Track Experiments , 2002, TREC.

[18]  Valter Crescenzi,et al.  Wrapping-oriented classification of web pages , 2002, SAC '02.

[19]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[20]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[21]  J. Scott Hawker,et al.  SA_MetaMatch: relevant document discovery through document metadata and indexing , 2004, ACM-SE 42.

[22]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[23]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[24]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[25]  Judith L. Klavans,et al.  Columbia Newsblaster: Multilingual News Summarization on the Web , 2004, NAACL.

[26]  David Carmel,et al.  Topic Distillation with Knowledge Agents , 2002, TREC.

[27]  John Shawe-Taylor,et al.  The Perceptron Algorithm with Uneven Margins , 2002, ICML.

[28]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[29]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[30]  T. Breuel Information Extraction from HTML Documents by Structural Matching , 2003 .

[31]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[32]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[33]  Min Zhang,et al.  DF or IDF? On the Use of HTML Primary Feature Fields for Web IR , 2003, WWW.

[34]  Bidyut Baran Chaudhuri,et al.  Extraction of type style-based meta-information from imaged documents , 2001, International Journal on Document Analysis and Recognition.

[35]  Maarten de Rijke,et al.  Wrapper Generation via Grammar Induction , 2000, ECML.

[36]  Weiyi Meng,et al.  Using the Structure of HTML Documents to Improve Retrieval , 1997, USENIX Symposium on Internet Technologies and Systems.

[37]  David Hawking,et al.  Overview of the TREC 2004 Web Track , 2004, TREC.

[38]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[39]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.