A Frame-Based Approach for Reference Metadata Extraction

In this paper, we propose a novel frame-based approach (FBA) and use reference metadata extraction as a case study to demonstrate its advantages. The main contributions of this research are three-fold. First, the new frame matching algorithm, based on sequence alignment, can compensate for the shortcomings of traditional rule-based approach, in which rule matching lacks flexibility and generality. Second, an approximate matching is adopted for capturing reasonable abbreviations or errors in the input reference string to further increase the coverage of the frames. Third, experiments conducted on extensive datasets show that the same knowledge framework performed equally well on various untrained domains. Comparing to a widely-used machine learning method, Conditional Random Fields (CRFs), the FBA can drastically reduce the average field error rate across all four independent test sets by 70% (2.24% vs. 7.54%).

[1]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[2]  Gobinda G. Chowdhury,et al.  Template Mining for Information Extraction from Digital Documents , 1999, Libr. Trends.

[3]  Risto Miikkulainen,et al.  Incremental nonmonotonic parsing through semantic self-organization , 2003 .

[4]  Shih-Hung Wu,et al.  Domain Event Extraction and Representation with Domain Ontology , 2003, IIWeb.

[5]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[6]  Stuart M. Shieber,et al.  Evidence against the context-freeness of natural language , 1985 .

[7]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[8]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[9]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[10]  Gobinda G. Chowdhury,et al.  Template mining for the extraction of citation from digital documents , 2001 .

[11]  Jan-Ming Ho,et al.  BibPro: A Citation Parser Based on Sequence Alignment , 2012, IEEE Trans. Knowl. Data Eng..

[12]  Mitchell P. Marcus,et al.  A theory of syntactic recognition for natural language , 1979 .

[13]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[14]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.