Extracting Author Meta-Data from Web Using Visual Features

Enriching digital library's author meta-data can lead to valuable services and applications. This paper addresses the problem of extracting authors' information from their homepages. This problem is actually a multiclass classification problem. A homepage can be treated as a group of information pieces which need to be classified to different fields, e.g., Name, Title, Affiliation, Email, etc. In this problem, not only each information piece can be viewed as a point in a feature space, but also certain patterns can be observed among different fields on a page. To improve the extraction accuracy, this paper argues that visual features of information pieces on a homepage should be sufficiently utilized. In addition, this paper also proposes an inter-fields probability model to capture the relation among different fields. This model can be combined with feature- space based classification. Experimental results demonstrate that utilizing visual features and applying the inter- fields probability model can significantly improve the extraction accuracy.

[1]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[2]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[3]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[4]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[5]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[6]  Wee Sun Lee,et al.  Using link analysis to improve layout on mobile devices , 2004, WWW '04.

[7]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[8]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  Sunita Sarawagi Automation in Information Extraction and Data Integration , 2002, VLDB.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Jane Yung-jen Hsu,et al.  Tree-Structured Template Generation for Web Pages , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[13]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[14]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[15]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[16]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[17]  Wei-Ying Ma,et al.  Object-level Vertical Search , 2007, CIDR.

[18]  Ji-Rong Wen,et al.  Template-Independent News Extraction Based on Visual Consistency , 2007, AAAI.

[19]  Wee Sun Lee,et al.  Understanding the function of web elements for mobile content delivery using random walk models , 2005, WWW '05.

[20]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.