Section Identification to Improve Information Extraction from Chinese Medical Literature

The Chinese medical literature contains a large amount of knowledge. Reducing the effort needed by medical scholars to extract this knowledge requires a literature analysis to identify the key information in each paper. We argue that identifying the sections of a paper would help us filter noise from the paper and increase the accuracy of extracting the experimental findings. In this research in progress, we consider paper section identification as a sentence classification task and apply Conditional Random Fields (CRFs) to tackle the problem. In our model we combine both lexical and structural features to facilitate section identification. Experiments on a human-curated asthma dataset show that our approach achieves a 10%–20% performance improvement over Support Vector Machines (SVMs), and that use of both bag-of-words features and domain lexicons benefit the task.

[1]  Stephen Cranefield,et al.  Context identification of sentences in related work sections using a conditional random field: towards intelligent digital libraries , 2010, JCDL '10.

[2]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[3]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[4]  Marco Lui Feature Stacking for Sentence Classification in Evidence-Based Medicine , 2012, ALTA.

[5]  Padmini Srinivasan,et al.  Categorization of Sentence Types in Medical Abstracts , 2003, AMIA.

[6]  Masashi Shimbo,et al.  Semi-supervised sentence classification for MEDLINE documents , 2004 .

[7]  Jun Zhao,et al.  Adding Redundant Features for CRFs-based Sentence Sentiment Classification , 2008, EMNLP.

[8]  David Martinez,et al.  Automatic classification of sentences for evidence based medicine , 2010, DTMBIO '10.

[9]  Xin Li,et al.  MedC: A Literature Analysis System for Chinese Medicine Research , 2015, ICSH.

[10]  Joe Carthy,et al.  Sentence-level event classification in unstructured texts , 2009, Information Retrieval.

[11]  Rebecca Smith,et al.  Automated ventricular systems segmentation in brain CT images by combining low-level segmentation and high-level template matching , 2009, BMC Medical Informatics Decis. Mak..

[12]  Claire Grover,et al.  Sequence modelling for sentence classification in a legal summarisation system , 2005, SAC '05.

[13]  Grace Yuet-Chee Chung,et al.  Sentence retrieval for abstracts of randomized controlled trials , 2009, BMC Medical Informatics Decis. Mak..

[14]  Yasunori Yamamoto,et al.  A Sentence Classification System for Multi Biomedical Literature Summarization , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[15]  Jimmy J. Lin,et al.  Answering Clinical Questions with Knowledge-Based and Statistical Techniques , 2007, CL.