论文信息 - Extracting Author Meta-Data from Web Using Visual Features - 字舞流文

Extracting Author Meta-Data from Web Using Visual Features

Enriching digital library's author meta-data can lead to valuable services and applications. This paper addresses the problem of extracting authors' information from their homepages. This problem is actually a multiclass classification problem. A homepage can be treated as a group of information pieces which need to be classified to different fields, e.g., Name, Title, Affiliation, Email, etc. In this problem, not only each information piece can be viewed as a point in a feature space, but also certain patterns can be observed among different fields on a page. To improve the extraction accuracy, this paper argues that visual features of information pieces on a homepage should be sufficiently utilized. In addition, this paper also proposes an inter-fields probability model to capture the relation among different fields. This model can be combined with feature- space based classification. Experimental results demonstrate that utilizing visual features and applying the inter- fields probability model can significantly improve the extraction accuracy.

C. Lee Giles | Jia Li | Ding Zhou | Shuyi Zheng | Jia Li | Ding Zhou | Shuyi Zheng

[1] Alberto H. F. Laender,et al. Automatic web news extraction using tree edit distance , 2004, WWW '04.

[2] Veljko M. Milutinovic,et al. Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[3] Hector Garcia-Molina,et al. Extracting structured data from Web pages , 2003, SIGMOD '03.

[4] Wei-Ying Ma,et al. Learning block importance models for web pages , 2004, WWW '04.

[5] Berthier A. Ribeiro-Neto,et al. A brief survey of web data extraction tools , 2002, SGMD.

[6] Wee Sun Lee,et al. Using link analysis to improve layout on mobile devices , 2004, WWW '04.

[7] Chun-Nan Hsu,et al. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[8] Edward A. Fox,et al. Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[9] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10] Sunita Sarawagi. Automation in Information Extraction and Data Integration , 2002, VLDB.

[11] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12] Jane Yung-jen Hsu,et al. Tree-Structured Template Generation for Web Pages , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[13] Vijay V. Raghavan,et al. Fully automatic wrapper generation for search engines , 2005, WWW '05.

[14] Craig A. Knoblock,et al. A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[15] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[16] Nicholas Kushmerick,et al. Wrapper Induction for Information Extraction , 1997, IJCAI.

[17] Wei-Ying Ma,et al. Object-level Vertical Search , 2007, CIDR.

[18] Ji-Rong Wen,et al. Template-Independent News Extraction Based on Visual Consistency , 2007, AAAI.

[19] Wee Sun Lee,et al. Understanding the function of web elements for mobile content delivery using random walk models , 2005, WWW '05.

[20] Ruihua Song,et al. Joint optimization of wrapper generation and template detection , 2007, KDD '07.