Unsupervised discovery of information structure in biomedical documents

MOTIVATION Information structure (IS) analysis is a text mining technique, which classifies text in biomedical articles into categories that capture different types of information, such as objectives, methods, results and conclusions of research. It is a highly useful technique that can support a range of Biomedical Text Mining tasks and can help readers of biomedical literature find information of interest faster, accelerating the highly time-consuming process of literature review. Several approaches to IS analysis have been presented in the past, with promising results in real-world biomedical tasks. However, all existing approaches, even weakly supervised ones, require several hundreds of hand-annotated training sentences specific to the domain in question. Because biomedicine is subject to considerable domain variation, such annotations are expensive to obtain. This makes the application of IS analysis across biomedical domains difficult. In this article, we investigate an unsupervised approach to IS analysis and evaluate the performance of several unsupervised methods on a large corpus of biomedical abstracts collected from PubMed. RESULTS Our best unsupervised algorithm (multilevel-weighted graph clustering algorithm) performs very well on the task, obtaining over 0.70 F scores for most IS categories when applied to well-known IS schemes. This level of performance is close to that of lightly supervised IS methods and has proven sufficient to aid a range of practical tasks. Thus, using an unsupervised approach, IS could be applied to support a wide range of tasks across sub-domains of biomedicine. We also demonstrate that unsupervised learning brings novel insights into IS of biomedical literature and discovers information categories that are not present in any of the existing IS schemes. AVAILABILITY AND IMPLEMENTATION The annotated corpus and software are available at http://www.cl.cam.ac.uk/∼dk427/bio14info.html.

[1]  Catherine Blake,et al.  Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles , 2010, J. Biomed. Informatics.

[2]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[3]  Naoaki Okazaki,et al.  Identifying Sections in Scientific Abstracts using Conditional Random Fields , 2008, IJCNLP.

[4]  Luciana B Sollaci,et al.  The introduction, methods, results, and discussion (IMRAD) structure: a fifty-year survey. , 2004, Journal of the Medical Library Association : JMLA.

[5]  Jimmy J. Lin,et al.  Generative Content Models for Structural Analysis of Medical Abstracts , 2006, BioNLP@NAACL-HLT.

[6]  Simone Teufel,et al.  Corpora for the Conceptualisation and Zoning of Scientific Papers , 2010, LREC.

[7]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[8]  Jean Carletta,et al.  An annotation scheme for discourse-level argumentation in research articles , 1999, EACL.

[9]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[10]  Patrick Ruch,et al.  Using Argumentation to Retrieve Articles with Similar Citations from MEDLINE , 2004, NLPBA/BioNLP.

[11]  Fabio Ciravegna,et al.  Unsupervised document zone identification using probabilistic graphical models , 2012, LREC.

[12]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[15]  Sophia Ananiadou,et al.  Analysing Entity Type Variation across Biomedical Subdomains , 2012 .

[16]  K. Bretonnel Cohen,et al.  Current issues in biomedical text mining and natural language processing , 2009, J. Biomed. Informatics.

[17]  Nigel Collier,et al.  Zone analysis in biology articles as a basis for information extraction , 2006, Int. J. Medical Informatics.

[18]  Nigel Collier,et al.  A baseline feature set for learning rhetorical zones using full articles in the biomedical domain , 2005, SKDD.

[19]  Maria Liakata,et al.  Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes , 2010, BioNLP@ACL.

[20]  Shashank Agarwal,et al.  Automatically Classifying Sentences in Full-Text Biomedical Articles into Introduction, Methods, Results and Discussion , 2009, Summit on translational bioinformatics.

[21]  C. J. van Rijsbergen,et al.  FOUNDATION OF EVALUATION , 1974 .

[22]  Maria Liakata,et al.  A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment , 2011, BMC Bioinformatics.

[23]  Dietrich Rebholz-Schuhmann,et al.  Using argumentation to extract key sentences from biomedical abstracts , 2007, Int. J. Medical Informatics.

[24]  Patrick Ruch,et al.  Using argumentation to retrieve articles with similar citations: An inquiry into improving related articles search in the MEDLINE digital library , 2006, Int. J. Medical Informatics.

[25]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[26]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[27]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[28]  Bonnie L. Webber,et al.  Discourse structure and language technology , 2011, Natural Language Engineering.

[29]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[30]  Anna Korhonen,et al.  Using Argumentative Zones for Extractive Summarization of Scientific Articles , 2012, COLING.

[31]  Simone Teufel,et al.  Towards Domain-Independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics , 2009, EMNLP.

[32]  Anna Korhonen,et al.  Improving Verb Clustering with Automatically Acquired Selectional Preferences , 2009, EMNLP.

[33]  Dina Demner-Fushman,et al.  Biomedical Text Mining: A Survey of Recent Progress , 2012, Mining Text Data.

[34]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[35]  Anna Korhonen,et al.  Weakly supervised learning of information structure of scientific abstracts - is it accurate enough to benefit real-world tasks in biomedicine? , 2011, Bioinform..

[36]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[38]  Wendy Filsell,et al.  What the papers say: Text mining for genomics and systems biology , 2010, Human Genomics.

[39]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[40]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[41]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[42]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Inderjit S. Dhillon,et al.  A fast kernel-based multilevel algorithm for graph clustering , 2005, KDD '05.

[44]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[45]  Johan Bos,et al.  Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.

[46]  Anna Korhonen,et al.  Exploring subdomain variation in biomedical language , 2010, BMC Bioinformatics.

[47]  Anna Korhonen,et al.  Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review , 2013, Bioinform..