A hybrid two-stage approach for discipline-independent canonical representation extraction from references

In education and research, references play a key role. However, extracting and parsing references are difficult problems. One concern is that there are many styles of references; hence, given a surface form, identifying what style was employed is problematic, especially in heterogeneous collections of theses and dissertations, which cover many fields and disciplines, and where different styles may be used even in the same publication. We address these problems by drawing upon suitable knowledge found in the WWW. In particular, we research a two-stage classifier approach, involving multi-class classification with respect to reference styles, and partially solve the problem of parsing surface representations of references. We describe empirical evidence for the effectiveness of our approach and plans for improvement of our methods.

[1]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Ian H. Witten,et al.  Tag insertion complexity , 2001, Proceedings DCC 2001. Data Compression Conference.

[4]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[5]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[6]  Ying Ding,et al.  Applying weighted PageRank to author citation networks , 2011, J. Assoc. Inf. Sci. Technol..

[7]  Jiangde Yu,et al.  Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[8]  Kôiti Hasida,et al.  Automatic Text Summarization Based on the Global Document Annotation , 1998, COLING-ACL.

[9]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[10]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[11]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[12]  Thomas Hofmann,et al.  Predicting structured objects with support vector machines , 2009, Commun. ACM.

[13]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Min-Yen Kan,et al.  FireCite: Lightweight real-time reference string extraction from webpages , 2009 .

[16]  Wolf-Tilo Balke,et al.  Rule based Autonomous Citation Mining with TIERL , 2010, J. Digit. Inf. Manag..

[17]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[18]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[19]  Andreas Strotmann,et al.  Combining commercial citation indexes and open-access bibliographic databases to delimit highly interdisciplinary research fields for citation analysis , 2010, J. Informetrics.

[20]  Michael E. Lesk,et al.  Making a digital library: the contents of the CORE project , 1997, TOIS.

[21]  Andrew McCallum,et al.  An Entity Based Model for Coreference Resolution , 2009, SDM.

[22]  Gobinda G. Chowdhury,et al.  Template mining for the extraction of citation from digital documents , 2001 .

[23]  Jie Zou,et al.  Locating and parsing bibliographic references in HTML medical articles , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[24]  Carla Caramella,et al.  Factorial analysis of the influence of dissolution medium on drug release from carrageenan-diltiazem complexes , 2000, AAPS PharmSciTech.

[25]  Gordon W. Paynter,et al.  Developing practical automatic metadata assignment and evaluation tools for internet resources , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[26]  Gideon S. Mann,et al.  Simple, robust, scalable semi-supervised learning via expectation regularization , 2007, ICML '07.

[27]  Klaus U. Schulz,et al.  Genre as noise: noise in genre , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[28]  Erik Hetzner A simple method for citation metadata extraction using hidden markov models , 2008, JCDL '08.

[29]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[30]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[31]  Rafael Bañares,et al.  Value of the Hepatic Venous Pressure Gradient to Monitor Drug Therapy for Portal Hypertension: A Meta-Analysis , 2007, The American Journal of Gastroenterology.

[32]  Guilherme Hoefel,et al.  Learning a two-stage SVM/CRF sequence classifier , 2008, CIKM '08.

[33]  Oren Etzioni,et al.  Machine Reading at the University of Washington , 2010, HLT-NAACL 2010.

[34]  Eugene Garfield,et al.  Citation indexing - its theory and application in science, technology, and humanities , 1979 .