Automated method for extracting “citation sentences” from online biomedical articles using SVM-based text summarization technique

Comment-on (CON), a MEDLINE citation field, indicates previously published articles commented on by authors of a given article expressing possibly complimentary or contradictory opinions. Our idea of identifying the CON list for a given article is to first extract all “citation sentences” from the body text, and then to recognize the sentences (“CON sentences”) among these that mention CON articles and to analyze the corresponding bibliographic data in the reference section. As a preprocessing step for identifying the CON list, this paper presents a general method for extracting “citation sentences” in the body text of online biomedical articles using a support vector machine (SVM)-based text summarization technique. Input feature vectors for the SVM are created by combining four types of features: 1) word statistics representing how differently a word occurs in “citation sentences” compared to other sentences, and the existence of 2) author names, 3) publication years, and 4) citation tags in a sentence. A rule-based post-processing step is also introduced to further reduce false negative errors in detecting “citation sentences”. Experiments on a set of online biomedical articles show that a SVM with a RBF achieves good performance overall in terms of accuracy, precision, recall, and F-measure rates. Our experiments also show that errors in extracting “citation sentences” cause a minor degradation of performance in identifying CON sentences, but can be improved through the proposed rule-based post-processing.