A General Learning Method for Automatic Title Extraction from HTML Pages

This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.

[1]  Jane Greenberg,et al.  Functionalities for automatic metadata generation applications: a survey of metadata experts' opinions , 2006, Int. J. Metadata Semant. Ontologies.

[2]  M. G. Sreekumar Digital libraries in knowledge management : Proceedings of the 7th MANLIBNET Annual National Convention held at Indian Institute of Management, Kozhikode during May5-7, 2005 , 2006 .

[3]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[4]  Ian Witten,et al.  Data Mining , 2000 .

[5]  Shuming Shi,et al.  Title extraction from bodies of HTML documents and its application to web page retrieval , 2005, SIGIR '05.

[6]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[7]  Maosong Sun,et al.  Automatic content based title extraction for Chinese documents using support vector machine , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[8]  Lizhen Liu,et al.  Metadata Extraction Based on Mutual Information in Digital Libraries , 2007, 2007 First IEEE International Symposium on Information Technologies and Applications in Education.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Jane Greenberg,et al.  Metadata Extraction and Harvesting , 2004 .

[12]  Indian,et al.  METADATA : AUTOMATIC GENERATION AND EXTRACTION , .

[13]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2006, Inf. Process. Manag..