A Structural SVM Approach for Reference Parsing

MEDLINE®, the flagship database of the U.S. National Library of Medicine, is a critical source of information for biomedical research and clinical medicine. The automated extraction of bibliographic data, such as article titles, author names, abstracts, and references, is essential to the affordable creation of this citation database. References, typically appearing at the end of journal articles, can provide valuable information for extracting Comment-On/Comment-In data (identifying commentary article pairs) and assigning MeSH terms in an article. The regular structure in references enables us to implement structural SVM, a newly developed structured learning algorithm to parse references. In this study we use two types of contextual features to compare structural SVM with conventional SVM. When only basic observation features are used for each token, structural SVM achieves higher performance compared to SVM since it utilizes the contextual label features. However, when the contextual observation features from neighboring tokens are combined, SVM performance improves greatly, and is close to that of structural SVM after adding the second order contextual observation features. Both methods achieve above 98% token classification accuracy and above 95% overall chunk-level accuracy for reference parsing.

[1]  Daniel X. Le,et al.  Identification of comment-on sentences in online biomedical documents using support vector machines , 2007, Electronic Imaging.

[2]  Gobinda G. Chowdhury,et al.  Template Mining for Information Extraction from Digital Documents , 1999, Libr. Trends.

[3]  Jan-Ming Ho,et al.  Extracting Citation Metadata from Online Publication Lists Using BLAST , 2004, PAKDD.

[4]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[5]  Jie Zou,et al.  Locating and parsing bibliographic references in HTML medical articles , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[6]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Shih-Hung Wu,et al.  A knowledge-based approach to citation extraction , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..

[11]  Abdel Belaïd,et al.  A segmentation method for bibliographic references by contextual tagging of fields , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[12]  Byung-Won On,et al.  Are your citations clean? , 2007, CACM.

[13]  Atsuhiro Takasu,et al.  Bibliographic Component Extraction Using Support Vector Machines and Hidden Markov Models , 2004, ECDL.

[14]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[15]  Eli Cortez,et al.  A flexible approach for extracting metadata from bibliographic citations , 2009 .

[16]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[17]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[18]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[19]  Abdel Belaïd,et al.  Logical structure recognition of scientific bibliographic references , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[20]  Gobinda G. Chowdhury,et al.  Template mining for the extraction of citation from digital documents , 2001 .