Generative Content Models for Structural Analysis of Medical Abstracts

The ability to accurately model the content structure of text is important for many natural language processing applications. This paper describes experiments with generative models for analyzing the discourse structure of medical abstracts, which generally follow the pattern of "introduction", "methods", "results", and "conclusions". We demonstrate that Hidden Markov Models are capable of accurately capturing the structure of such texts, and can achieve classification accuracy comparable to that of discriminative techniques. In addition, generative approaches provide advantages that may make them preferable to discriminative techniques such as Support Vector Machines under certain conditions. Our work makes two contributions: at the application level, we report good performance on an interesting task in an important domain; more generally, our results contribute to an ongoing discussion regarding the tradeoffs between generative and discriminative techniques.

[1]  D. Covell,et al.  Information needs in office practice: are they being met? , 1985, Annals of internal medicine.

[2]  F. Gutzwiller,et al.  A proposal for more informative abstracts of clinical articles. Ad Hoc Working Group for Critical Appraisal of the Medical Literature. , 1987, Annals of internal medicine.

[3]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[4]  Kathleen McKeown,et al.  Text generation: using discourse strategies and focus constraints to generate natural language text , 1985 .

[5]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Yearbook of Medical Informatics.

[6]  P. Gorman,et al.  Can primary care physicians' questions be answered using the medical journal literature? , 1994, Bulletin of the Medical Library Association.

[7]  Steve Young,et al.  The HTK book , 1995 .

[8]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9]  J. R. Koehler,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[10]  Marc Moens,et al.  What's Yours and What's Mine: Determining Intellectual Attribution in Scientific Text , 2000, EMNLP.

[11]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[12]  Constantin Orasan Patterns in Scientific Abstracts , 2001 .

[13]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[14]  Daniel Marcu,et al.  An Unsupervised Approach to Recognizing Discourse Relations , 2002, ACL.

[15]  Vasileios Hatzivassiloglou,et al.  Leveraging a common representation for personalized search and summarization in a medical digital library , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[16]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[17]  Padmini Srinivasan,et al.  Categorization of Sentence Types in Medical Abstracts , 2003, AMIA.

[18]  Patrick Ruch,et al.  Report on the TREC 2003 Experiment: Genomic Track , 2003, TREC.

[19]  Mark J. F. Gales,et al.  Development of the 2003 CU-HTK conversational telephone speech transcription system , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Regina Barzilay,et al.  Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization , 2004, NAACL.

[21]  Patrick Ruch,et al.  Report on the TREC 2004 Experiment: Genomics Track , 2004, TREC.

[22]  George R. Thoma,et al.  The Role of Title, Metadata and Abstract in Identifying Clinically Relevant Journal Articles , 2005, AMIA.

[23]  Alan R. Aronson,et al.  Semi-Automatic Indexing of Full Text Biomedical Articles , 2005, AMIA.

[24]  Jimmy J. Lin,et al.  Knowledge Extraction for Clinical Question Answering: Preliminary Results , 2005 .

[25]  Patrick Ruch,et al.  Using argumentation to retrieve articles with similar citations: An inquiry into improving related articles search in the MEDLINE digital library , 2006, Int. J. Medical Informatics.

[26]  Nigel Collier,et al.  Zone analysis in biology articles as a basis for information extraction , 2006, Int. J. Medical Informatics.