A knowledge-based approach to citation extraction

Integration of the bibliographical information of scholarly publications available on the Internet is an important task in academic research. To accomplish this task, accurate reference metadata extraction for scholarly publications is essential for the integration of information from heterogeneous reference sources. In this paper, we propose a knowledge-based approach to literature mining and focus on reference metadata extraction methods for scholarly publications. We adopt an ontological knowledge representation framework called INFOMAP to automatically extract the reference metadata. The experimental results show that, by using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different reference styles with a high degree of accuracy. The overall average field accuracy of citation extraction for a bioinformatics dataset is 97.87% for six reference styles.

[1]  Gobinda G. Chowdhury,et al.  Template Mining for Information Extraction from Digital Documents , 1999, Libr. Trends.

[2]  Shih-Hung Wu,et al.  FAQ-Centered Organizational Memory , 2002 .

[3]  C. Lee Giles,et al.  Scholarly publishing in the Internet age: a citation analysis of computer science literature , 2001, Inf. Process. Manag..

[4]  Wen-Lian Hsu,et al.  The coloring and maximum independent set problems on planar perfect graphs , 1988, JACM.

[5]  Remco R. Bouckaert Low Level Information Extraction: a Bayesian network based approach , 2002 .

[6]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[7]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[8]  Shih-Hung Wu,et al.  Domain Event Extraction and Representation with Domain Ontology , 2003, IIWeb.

[9]  Péter Jacsó,et al.  The future of citation indexing: An interview with Eugene Garfield , 2004 .

[10]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[11]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[12]  H. D. Thomas,et al.  SUCCESSFUL KNOWLEDGE MANAGEMENT PROJECTS , 1998 .

[13]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[14]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[15]  Eduard H. Hovy,et al.  Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked , 2003, ACL.

[16]  Kathleen Burnett,et al.  A Comparison of the Two Traditions of Metadata Development , 1999, J. Am. Soc. Inf. Sci..

[17]  W. Hsu On the General Feasibility Test of Scheduling Lot Sizes for Several Products on One Machine , 1983 .

[18]  Gobinda G. Chowdhury,et al.  Template mining for the extraction of citation from digital documents , 2001 .

[19]  Steffen Staab,et al.  Ontology Learning from Text , 2000, NLDB.

[20]  Schubert Foo,et al.  Ontology research and development. Part 1 - a review of ontology generation , 2002, J. Inf. Sci..

[21]  Wen-Lian Hsu,et al.  The distance-domination numbers of trees , 1982, Oper. Res. Lett..