Title extraction from bodies of HTML documents and its application to web page retrieval

This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1% -29.0% improvements).

[1]  Weiyi Meng,et al.  Using the Structure of HTML Documents to Improve Retrieval , 1997, USENIX Symposium on Internet Technologies and Systems.

[2]  Line Eikvil,et al.  Information Extraction from World Wide Web - A Survey , 1999 .

[3]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[4]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[5]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[6]  Maarten de Rijke,et al.  Wrapper Generation via Grammar Induction , 2000, ECML.

[7]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[8]  John Shawe-Taylor,et al.  The Perceptron Algorithm with Uneven Margins , 2002, ICML.

[9]  David Carmel,et al.  Topic Distillation with Knowledge Agents , 2002, TREC.

[10]  Yiqun Liu,et al.  THU TREC 2002: Novelty Track Experiments , 2002, TREC.

[11]  Proceedings of The Eleventh Text REtrieval Conference, TREC 2002, Gaithersburg, Maryland, USA, November 19-22, 2002 , 2002, TREC.

[12]  Valter Crescenzi,et al.  Wrapping-oriented classification of web pages , 2002, SAC '02.

[13]  Kevyn Collins-Thompson,et al.  Information Filtering, Novelty Detection, and Named-Page Finding , 2002, TREC.

[14]  James P. Callan,et al.  Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding , 2003, TREC.

[15]  Min Zhang,et al.  DF or IDF? On the Use of HTML Primary Feature Fields for Web IR , 2003, WWW.

[16]  Maurice Bruynooghe,et al.  Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference , 2003, IJCAI.

[17]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[18]  Timothy C. Craven HTML Tags as Extraction Cues for Web Page Description Construction , 2003, Informing Sci. Int. J. an Emerg. Transdiscipl..

[19]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[20]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[21]  T. Breuel Information Extraction from HTML Documents by Structural Matching , 2003 .

[22]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[23]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[24]  J. Scott Hawker,et al.  SA_MetaMatch: relevant document discovery through document metadata and indexing , 2004, ACM-SE 42.

[25]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[26]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[27]  Tao Qin,et al.  Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004 , 2004, TREC.

[28]  Judith L. Klavans,et al.  Columbia Newsblaster: Multilingual News Summarization on the Web , 2004, NAACL.