A structural SVM approach for reference parsing

BackgroundAutomated extraction of bibliographic data, such as article titles, author names, abstracts, and references is essential to the affordable creation of large citation databases. References, typically appearing at the end of journal articles, can also provide valuable information for extracting other bibliographic data. Therefore, parsing individual reference to extract author, title, journal, year, etc. is sometimes a necessary preprocessing step in building citation-indexing systems. The regular structure in references enables us to consider reference parsing a sequence learning problem and to study structural Support Vector Machine (structural SVM), a newly developed structured learning algorithm on parsing references.ResultsIn this study, we implemented structural SVM and used two types of contextual features to compare structural SVM with conventional SVM. Both methods achieve above 98% token classification accuracy and above 95% overall chunk-level accuracy for reference parsing. We also compared SVM and structural SVM to Conditional Random Field (CRF). The experimental results show that structural SVM and CRF achieve similar accuracies at token- and chunk-levels.ConclusionsWhen only basic observation features are used for each token, structural SVM achieves higher performance compared to SVM since it utilizes the contextual label features. However, when the contextual observation features from neighboring tokens are combined, SVM performance improves greatly, and is close to that of structural SVM after adding the second order contextual observation features. The comparison of these two methods with CRF using the same set of binary features show that both structural SVM and CRF perform better than SVM, indicating their stronger sequence learning ability in reference parsing.

[1]  Thomas R Kosten,et al.  Novel Approaches to the Treatment of Cocaine Addiction , 2005, CNS drugs.

[2]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[3]  Marcos André Gonçalves,et al.  A flexible approach for extracting metadata from bibliographic citations , 2009, J. Assoc. Inf. Sci. Technol..

[4]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[5]  Shih-Hung Wu,et al.  A knowledge-based approach to citation extraction , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..

[6]  Atsuhiro Takasu,et al.  Bibliographic Component Extraction Using Support Vector Machines and Hidden Markov Models , 2004, ECDL.

[7]  K. Kasuya,et al.  Properties of a Poly(3-hydroxybutyrate) Depolymerase from Penicillium funiculosum , 2000 .

[8]  Byung-Won On,et al.  Are your citations clean? , 2007, CACM.

[9]  Abdel Belaïd,et al.  A segmentation method for bibliographic references by contextual tagging of fields , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[10]  Gobinda G. Chowdhury,et al.  Template mining for the extraction of citation from digital documents , 2001 .

[11]  J. Devesa,et al.  Myostatin is an inhibitor of myogenic differentiation. , 2002, American journal of physiology. Cell physiology.

[12]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[13]  Xiaoli Zhang,et al.  A Structural SVM Approach for Reference Parsing , 2010, ICMLA.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[16]  C. Tebbe,et al.  Diversity of bacteria associated with Collembola - a cultivation-independent survey based on PCR-amplified 16S rRNA genes. , 2004, FEMS microbiology ecology.

[17]  张嵬 莫梅琦,et al.  ISI Web of Knowledge体系检索特色与应用评析 , 2003 .

[18]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[19]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[20]  Abdel Belaïd,et al.  Logical structure recognition of scientific bibliographic references , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[21]  Gobinda G. Chowdhury,et al.  Template Mining for Information Extraction from Digital Documents , 1999, Libr. Trends.

[22]  Jan-Ming Ho,et al.  Extracting Citation Metadata from Online Publication Lists Using BLAST , 2004, PAKDD.

[23]  Jie Zou,et al.  Locating and parsing bibliographic references in HTML medical articles , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Daniel X. Le,et al.  Identification of comment-on sentences in online biomedical documents using support vector machines , 2007, Electronic Imaging.

[26]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..