Automatically Classifying Sentences in Full-Text Biomedical Articles into Introduction, Methods, Results and Discussion

Biomedical texts can be typically represented by four rhetorical categories: introduction, methods, results and discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied approaches to automatically classify sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We explored different approaches to automatically classify a sentence in a full-text biomedical article into the IMRAD categories. Our best system is a support vector machine classifier that achieved 81.30% accuracy, which is significantly higher than baseline systems.

[1]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[2]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[3]  Er Crowther,et al.  How to Write & Publish A Scientific Paper. , 1993 .

[4]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[5]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[6]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[7]  Yongji Wang,et al.  Tissue-specific distributions of alternatively spliced human PECAM-1 isoforms. , 2003, American journal of physiology. Heart and circulatory physiology.

[8]  Hong Yu,et al.  Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences , 2003, EMNLP.

[9]  Padmini Srinivasan,et al.  Categorization of Sentence Types in Medical Abstracts , 2003, AMIA.

[10]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[11]  R. Sutcliffe,et al.  A Qualitative Comparison of Scientific and Journalistic Texts from the Perspective of Extracting Definitions , 2004 .

[12]  Nigel Collier,et al.  A baseline feature set for learning rhetorical zones using full articles in the biomedical domain , 2005, SKDD.

[13]  Jimmy J. Lin,et al.  Generative Content Models for Structural Analysis of Medical Abstracts , 2006, BioNLP@NAACL-HLT.

[14]  Nigel Collier,et al.  Zone analysis in biology articles as a basis for information extraction , 2006, Int. J. Medical Informatics.

[15]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[16]  George Hripcsak,et al.  Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians , 2007, J. Biomed. Informatics.

[17]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[18]  Hong Yu,et al.  Are figure legends sufficient? Evaluating the contribution of associated text to biomedical figure comprehension , 2009, Journal of biomedical discovery and collaboration.