Locating and parsing bibliographic references in HTML medical articles

The set of references that typically appear toward the end of journal articles is sometimes, though not always, a field in bibliographic (citation) databases. But even if references do not constitute such a field, they can be useful as a preprocessing step in the automated extraction of other bibliographic data from articles, as well as in computer-assisted indexing of articles. Automation in data extraction and indexing to minimize human labor is key to the affordable creation and maintenance of large bibliographic databases. Extracting the components of references, such as author names, article title, journal name, publication date and other entities, is therefore a valuable and sometimes necessary task. This paper describes a two-step process using statistical machine learning algorithms, to first locate the references in HTML medical articles and then to parse them. Reference locating identifies the reference section in an article and then decomposes it into individual references. We formulate this step as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles drawn from 100 medical journals achieves near-perfect precision and recall rates for locating references. Reference parsing identifies the components of each reference. For this second step, we implement and compare two algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, followed by a search algorithm that systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference-parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.

[1]  Timo Laakko,et al.  Two approaches to bringing Internet services to WAP devices , 2000, Comput. Networks.

[2]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[3]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[4]  Daniel X. Le,et al.  Identification of comment-on sentences in online biomedical documents using support vector machines , 2007, Electronic Imaging.

[5]  C. Wiener,et al.  Factors contributing to the hospitalization of nursing home residents. , 1989, The Gerontologist.

[6]  P. Tjaden,et al.  Prevalence, Incidence, and Consequences of Violence Against Women: Findings From the National Violence Against Women Survey , 1998 .

[7]  R A Weinstein,et al.  Multiple antibiotic-resistant Klebsiella and Escherichia coli in nursing homes. , 1999, JAMA.

[8]  Jie Zou,et al.  Extracting a sparsely located named entity from online HTML medical articles using support vector machine , 2008, Electronic Imaging.

[9]  Gobinda G. Chowdhury,et al.  Template Mining for Information Extraction from Digital Documents , 1999, Libr. Trends.

[10]  Shih-Hung Wu,et al.  Reference metadata extraction using a hierarchical knowledge representation framework , 2007, Decis. Support Syst..

[11]  Jan-Ming Ho,et al.  Extracting Citation Metadata from Online Publication Lists Using BLAST , 2004, PAKDD.

[12]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[13]  Jie Zou,et al.  Structure and content analysis for html medical articles: a hidden markov model approach , 2007, DocEng '07.

[14]  A. Sparks,et al.  Using the transcriptome to annotate the genome , 2002, Nature Biotechnology.

[15]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[16]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[17]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[18]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[19]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[20]  Abdel Belaïd,et al.  Logical structure recognition of scientific bibliographic references , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[21]  Thomas Kieninger,et al.  Rule-based document structure understanding with a fuzzy combination of layout and textual features , 2001, International Journal on Document Analysis and Recognition.

[22]  Robert L. Grossman,et al.  Mining Web pages for data records , 2004, IEEE Intelligent Systems.

[23]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[24]  Andreas Paepcke,et al.  Accordion summarization for end-game browsing on PDAs and cellular phones , 2001, CHI.

[25]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[26]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[27]  Hongjun Lu,et al.  Toward Learning Based Web Query Processing , 2000, VLDB.

[28]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[32]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[33]  Marcos André Gonçalves,et al.  A flexible approach for extracting metadata from bibliographic citations , 2009, J. Assoc. Inf. Sci. Technol..

[34]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[36]  Bruce A. Roe,et al.  DNA Isolation and Sequencing , 1996 .

[37]  Daniel X. Le,et al.  Automated labeling in document images , 2000, IS&T/SPIE Electronic Imaging.

[38]  James E. Childs,et al.  Ehrlichia chaffeensis: a Prototypical Emerging Pathogen , 2003, Clinical Microbiology Reviews.

[39]  Gobinda G. Chowdhury,et al.  Template mining for the extraction of citation from digital documents , 2001 .

[40]  Bing Liu,et al.  Structured data extraction from the web , 2006 .

[41]  Laurence Likforman-Sulem,et al.  Automatic name extraction from degraded document images , 2006, Pattern Analysis and Applications.

[42]  Atsuhiro Takasu,et al.  Bibliographic Component Extraction Using Support Vector Machines and Hidden Markov Models , 2004, ECDL.

[43]  Daniel X. Le,et al.  Automated zone correction in bitmapped document images , 1999, Electronic Imaging.

[44]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[45]  Henry S. Baird,et al.  Image segmentation by shape-directed covers , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[46]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[47]  Shih-Hung Wu,et al.  A knowledge-based approach to citation extraction , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..

[48]  Abdel Belaïd,et al.  A segmentation method for bibliographic references by contextual tagging of fields , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..