Dynamic summarization of bibliographic-based data

BackgroundTraditional information retrieval techniques typically return excessive output when directed at large bibliographic databases. Natural Language Processing applications strive to extract salient content from the excessive data. Semantic MEDLINE, a National Library of Medicine (NLM) natural language processing application, highlights relevant information in PubMed data. However, Semantic MEDLINE implements manually coded schemas, accommodating few information needs. Currently, there are only five such schemas, while many more would be needed to realistically accommodate all potential users. The aim of this project was to develop and evaluate a statistical algorithm that automatically identifies relevant bibliographic data; the new algorithm could be incorporated into a dynamic schema to accommodate various information needs in Semantic MEDLINE, and eliminate the need for multiple schemas.MethodsWe developed a flexible algorithm named Combo that combines three statistical metrics, the Kullback-Leibler Divergence (KLD), Riloff's RlogF metric (RlogF), and a new metric called PredScal, to automatically identify salient data in bibliographic text. We downloaded citations from a PubMed search query addressing the genetic etiology of bladder cancer. The citations were processed with SemRep, an NLM rule-based application that produces semantic predications. SemRep output was processed by Combo, in addition to the standard Semantic MEDLINE genetics schema and independently by the two individual KLD and RlogF metrics. We evaluated each summarization method using an existing reference standard within the task-based context of genetic database curation.ResultsCombo asserted 74 genetic entities implicated in bladder cancer development, whereas the traditional schema asserted 10 genetic entities; the KLD and RlogF metrics individually asserted 77 and 69 genetic entities, respectively. Combo achieved 61% recall and 81% precision, with an F-score of 0.69. The traditional schema achieved 23% recall and 100% precision, with an F-score of 0.37. The KLD metric achieved 61% recall, 70% precision, with an F-score of 0.65. The RlogF metric achieved 61% recall, 72% precision, with an F-score of 0.66.ConclusionsSemantic MEDLINE summarization using the new Combo algorithm outperformed a conventional summarization schema in a genetic database curation task. It potentially could streamline information acquisition for other needs without having to hand-build multiple saliency schemas.

[1]  Hsinchun Chen,et al.  Medical Informatics: Knowledge Management and Data Mining in Biomedicine (Operations Research/Computer Science Interfaces) , 2005 .

[2]  Ellen Riloff,et al.  An Introduction to the Sundance and AutoSlog Systems , 2011 .

[3]  Falk Scholer,et al.  Boolean versus ranked querying for biomedical systematic reviews , 2010, BMC Medical Informatics Decis. Mak..

[4]  Marcelo Fiszman,et al.  Semantic Interpretation for the Biomedical Research Literature , 2005 .

[5]  M. Chambliss,et al.  Answering clinical questions. , 1996, The Journal of family practice.

[6]  Dina Demner-Fushman,et al.  Towards Automating the Initial Screening Phase of a Systematic Review , 2010, MedInfo.

[7]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[8]  Akihiro Yamanashi,et al.  Assessment of risk factors for second hip fractures in Japanese elderly , 2005, Osteoporosis International.

[9]  Inderjeet Mani,et al.  The Challenges of Automatic Summarization , 2000, Computer.

[10]  Halil Kilicoglu,et al.  Summarizing Drug Information in Medline Citations , 2006, AMIA.

[11]  Keke Chen,et al.  Model Formulation: A Document Clustering and Ranking System for Exploring MEDLINE Citations , 2007, J. Am. Medical Informatics Assoc..

[12]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[13]  Marcelo Fiszman,et al.  Biomedical text summarization to support genetic database curation: using Semantic MEDLINE to create a secondary database of genetic information. , 2010, Journal of the Medical Library Association : JMLA.

[14]  J.B. Bowles,et al.  A Lightweight Tool for Automatically Extracting Causal Relationships from Text , 2006, Proceedings of the IEEE SoutheastCon 2006.

[15]  Marcelo Fiszman,et al.  Extracting Semantic Predications from Medline Citations for Pharmacogenomics , 2006, Pacific Symposium on Biocomputing.

[16]  Cynthia Fraser,et al.  Identifying observational studies of surgical interventions in MEDLINE and EMBASE , 2006, BMC medical research methodology.

[17]  David L. Sackett,et al.  Evidence based medicine: What it is and what it isn't (reprinted from BMJ, vol 312, pg 71-72, 1996) , 2007 .

[18]  Markus Follmann,et al.  Developing search strategies for clinical practice guidelines in SUMSearch and Google Scholar and assessing their retrieval performance , 2007, BMC medical research methodology.

[19]  M Thorogood,et al.  Dietary advice for reducing cardiovascular risk. , 2013, The Cochrane database of systematic reviews.

[20]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[21]  Graeme Hirst,et al.  Using Outcome Polarity in Sentence Extraction for Medical Question-Answering , 2006, AMIA.

[22]  Charles Sneiderman,et al.  Semantic Processing to Enhance Retrieval of Diagnosis Citations from Medline , 2006, AMIA.

[23]  Halil Kilicoglu,et al.  Semantic MEDLINE: A web application for managing the results of PubMed searches , 2008, SMBM 2008.

[24]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[25]  W R Hersh,et al.  How well do physicians use electronic information retrieval systems? A framework for investigation and systematic review. , 1998, JAMA.

[26]  Fakhri Karray,et al.  Semantic Understanding of General Linguistic Items by Means of Fuzzy Set Theory , 2007, IEEE Transactions on Fuzzy Systems.

[27]  Qinghua Zou,et al.  Modeling Medical Content for Automated Summarization , 2002, Annals of the New York Academy of Sciences.

[28]  Halil Kilicoglu,et al.  Abstraction Summarization for Managing the Biomedical Research Literature , 2004, HLT-NAACL 2004.

[29]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[30]  Halil Kilicoglu,et al.  Automatic summarization of MEDLINE citations for evidence-based medical treatment: A topic-oriented evaluation , 2009, J. Biomed. Informatics.

[31]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[32]  David R Davies,et al.  The type 1 insulin‐like growth factor receptor is over‐expressed in bladder cancer , 2007, BJU international.

[33]  D. Sackett,et al.  Evidence based medicine: what it is and what it isn't , 1996, BMJ.

[34]  S. Golder,et al.  Developing efficient search strategies to identify reports of adverse effects in MEDLINE and EMBASE. , 2006, Health information and libraries journal.