Extracting Author Meta-Data from Web Using Visual Features

Enriching digital library's author meta-data can lead to valuable services and applications. This paper addresses the problem of extracting authors' information from their homepages. This problem is actually a multiclass classification problem. A homepage can be treated as a group of information pieces which need to be classified to different fields, e.g., Name, Title, Affiliation, Email, etc. In this problem, not only each information piece can be viewed as a point in a feature space, but also certain patterns can be observed among different fields on a page. To improve the extraction accuracy, this paper argues that visual features of information pieces on a homepage should be sufficiently utilized. In addition, this paper also proposes an inter-fields probability model to capture the relation among different fields. This model can be combined with feature- space based classification. Experimental results demonstrate that utilizing visual features and applying the inter- fields probability model can significantly improve the extraction accuracy.

[1]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[2]  Hong Yu,et al.  Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences , 2003, EMNLP.

[3]  Ellen Riloff,et al.  Learning Extraction Patterns for Subjective Expressions , 2003, EMNLP.

[4]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[5]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[6]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[7]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[8]  Jane Yung-jen Hsu,et al.  Tree-Structured Template Generation for Web Pages , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[9]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[10]  Xiaoyan Zhu,et al.  Movie review mining and summarization , 2006, CIKM '06.

[11]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[12]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[14]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[15]  Sunita Sarawagi Automation in Information Extraction and Data Integration , 2002, VLDB.

[16]  Wee Sun Lee,et al.  Using link analysis to improve layout on mobile devices , 2004, WWW '04.

[17]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[18]  Wei-Ying Ma,et al.  Object-level Vertical Search , 2007, CIDR.

[19]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[20]  Wee Sun Lee,et al.  Understanding the function of web elements for mobile content delivery using random walk models , 2005, WWW '05.

[21]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[22]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[23]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[24]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[25]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[26]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..