Improving MeSH classification of biomedical articles using citation contexts

Medical Subject Headings (MeSH) are used to index the majority of databases generated by the National Library of Medicine. Essentially, MeSH terms are designed to make information, such as scientific articles, more retrievable and assessable to users of systems such as PubMed. This paper proposes a novel method for automating the assignment of biomedical publications with MeSH terms that takes advantage of citation references to these publications. Our findings show that analysing the citation references that point to a document can provide a useful source of terms that are not present in the document. The use of these citation contexts, as they are known, can thus help to provide a richer document feature representation, which in turn can help improve text mining and information retrieval applications, in our case MeSH term classification. In this paper, we also explore new methods of selecting and utilising citation contexts. In particular, we assess the effect of weighting the importance of citation terms (found in the citation contexts) according to two aspects: (i) the section of the paper they appear in and (ii) their distance to the citation marker. We conduct intrinsic and extrinsic evaluations of citation term quality. For the intrinsic evaluation, we rely on the UMLS Metathesaurus conceptual database to explore the semantic characteristics of the mined citation terms. We also analyse the "informativeness" of these terms using a class-entropy measure. For the extrinsic evaluation, we run a series of automatic document classification experiments over MeSH terms. Our experimental evaluation shows that citation contexts contain terms that are related to the original document, and that the integration of this knowledge results in better classification performance compared to two state-of-the-art MeSH classification systems: MeSHUP and MTI. Our experiments also demonstrate that the consideration of Section and Distance factors can lead to statistically significant improvements in citation feature quality, thus opening the way for better document feature representation in other biomedical text processing applications.

[1]  James Bailey,et al.  Document clustering of scientific texts using citation contexts , 2010, Information Retrieval.

[2]  Alexander Gammerman,et al.  Computational Learning and Probabilistic Reasoning , 1996 .

[3]  Dragomir R. Radev,et al.  Using Citations to Generate surveys of Scientific Paradigms , 2009, NAACL.

[4]  Hidetsugu Nanba,et al.  Towards multi-paper summarization reference information , 1999, IJCAI 1999.

[5]  Stephen Wan,et al.  Whetting the appetite of scientists: producing summaries tailored to the citation context , 2009, JCDL '09.

[6]  Stephen Wan,et al.  Supporting browsing-specific information needs: Introducing the Citation-Sensitive In-Browser Summariser , 2010, J. Web Semant..

[7]  Simone Teufel,et al.  Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries , 2009 .

[8]  Anne-Lise Veuthey,et al.  Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation , 2003, ISMB.

[9]  Noriko Kando,et al.  Classification of research papers using citation links and citation types: Towards automatic review article generation. , 2011 .

[10]  Simone Teufel,et al.  Argumentative zoning information extraction from scientific text , 1999 .

[11]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[12]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[13]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[14]  Manabu Okumura,et al.  Towards Multi-paper Summarization Using Reference Information , 1999, IJCAI.

[15]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[16]  Vladimir Vapnik Transductive Inference and Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[17]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[18]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[19]  Sayan Mukherjee,et al.  Classifying Microarray Data Using Support Vector Machines , 2003 .

[20]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[21]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[22]  Robert E. Mercer,et al.  A Design Methodology for a Biomedical Literature Indexing Tool Using the Rhetoric of Science , 2004, HLT-NAACL 2004.

[23]  Kristian J. Hammond,et al.  Reference directed indexing: indexing scientific literature in the context of its use , 2002 .

[24]  Simone Teufel,et al.  How to Find Better Index Terms Through Citations , 2006 .

[25]  Daniel Kifer,et al.  Context-aware citation recommendation , 2010, WWW '10.

[26]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[27]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[28]  Shannon Bradshaw,et al.  Reference Directed Indexing: Redeeming Relevance for Subject Search in Citation Indexes , 2003, ECDL.

[29]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[30]  Marek Reformat,et al.  Multilabel associative classification categorization of MEDLINE articles into MeSH keywords. , 2007, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[31]  Simone Teufel,et al.  Towards Domain-Independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics , 2009, EMNLP.

[32]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[33]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[34]  Iadh Ounis,et al.  Proceedings of the IR research, 30th European conference on Advances in information retrieval , 2008 .

[35]  Mark A. Musen,et al.  UMLS-Query: A Perl Module for Querying the UMLS , 2008, AMIA.

[36]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[37]  Manabu Okumura,et al.  Bilingual PRESRI - Integration of Multiple Research Paper Databases , 2004, RIAO.

[38]  Tarun Kumar,et al.  Identifying citing sentences in research papers using supervised learning , 2010, 2010 International Conference on Information Retrieval & Knowledge Management (CAMP).

[39]  Alan R. Aronson,et al.  Application of a Medical Text Indexer to an Online Dermatology Atlas , 2004, MedInfo.

[40]  Stephen Wan,et al.  Designing a Citation-Sensitive Research Tool: An Initial Study of Browsing-Specific Information Needs , 2009 .

[41]  Justin Zobel,et al.  Document expansion versus query expansion for ad-hoc retrieval , 2005 .

[42]  Ian Witten,et al.  Data Mining , 2000 .

[43]  J.A. Keane,et al.  Finding related documents via communities in the citation graph , 2004, IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004..

[44]  Jaap Kamps,et al.  The importance of anchor text for ad hoc search revisited , 2010, SIGIR '10.

[45]  Manabu Okumura,et al.  Automatic Detection of Survey Articles , 2005, ECDL.

[46]  Fang Liu,et al.  FigSearch: a figure legend indexing and classification system , 2004, Bioinform..

[47]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[48]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[49]  Stephen E. Robertson,et al.  Using Terms from Citations for IR: Some First Results , 2008, ECIR.

[50]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[51]  Stephen E. Robertson,et al.  Comparing citation contexts for information retrieval , 2008, CIKM '08.

[52]  Kristian J. Hammond,et al.  Automatically indexing documents: content vs. reference , 2002, IUI '02.

[53]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[54]  Dragomir R. Radev,et al.  Blind men and elephants: What do citation summaries tell us about a research article? , 2008 .

[55]  Sunghwan Sohn,et al.  Research Paper: Optimal Training Sets for Bayesian Prediction of MeSH® Assignment , 2008, J. Am. Medical Informatics Assoc..

[56]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .