Extractive text summarization system to aid data extraction from full text in systematic review development

OBJECTIVES Extracting data from publication reports is a standard process in systematic review (SR) development. However, the data extraction process still relies too much on manual effort which is slow, costly, and subject to human error. In this study, we developed a text summarization system aimed at enhancing productivity and reducing errors in the traditional data extraction process. METHODS We developed a computer system that used machine learning and natural language processing approaches to automatically generate summaries of full-text scientific publications. The summaries at the sentence and fragment levels were evaluated in finding common clinical SR data elements such as sample size, group size, and PICO values. We compared the computer-generated summaries with human written summaries (title and abstract) in terms of the presence of necessary information for the data extraction as presented in the Cochrane review's study characteristics tables. RESULTS At the sentence level, the computer-generated summaries covered more information than humans do for systematic reviews (recall 91.2% vs. 83.8%, p<0.001). They also had a better density of relevant sentences (precision 59% vs. 39%, p<0.001). At the fragment level, the ensemble approach combining rule-based, concept mapping, and dictionary-based methods performed better than individual methods alone, achieving an 84.7% F-measure. CONCLUSION Computer-generated summaries are potential alternative information sources for data extraction in systematic review development. Machine learning and natural language processing are promising approaches to the development of such an extractive summarization system.

[1]  Ulrich Schäfer,et al.  Advances in Deep Parsing of Scholarly Paper Content , 2009, NLP4DL/AT4DL.

[2]  José Luis,et al.  "Support Vector Feature Selection for Early Detection of Anastomosis Leakage from Bag-of-Words in Electronic Health Records" , 2014 .

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Jau-Min Wong,et al.  PICO element detection in medical text without metadata: Are first sentences enough? , 2013, J. Biomed. Informatics.

[5]  Joel D. Martin,et al.  Automated Information Extraction of Key Trial Design Elements from Clinical Trial Publications , 2008, AMIA.

[6]  Guilherme Del Fiol,et al.  Text summarization in the biomedical domain: A systematic review of recent research , 2014, J. Biomed. Informatics.

[7]  Duy Duc An Bui,et al.  Research and applications: Learning regular expressions for clinical text classification , 2014, J. Am. Medical Informatics Assoc..

[8]  Albert Gatt,et al.  Summarising Complex ICU Data in Natural Language , 2008, AMIA.

[9]  Noémie Elhadad,et al.  Automated methods for the summarization of electronic health records , 2015, J. Am. Medical Informatics Assoc..

[10]  Michele Tarsilla Cochrane Handbook for Systematic Reviews of Interventions , 2010, Journal of MultiDisciplinary Evaluation.

[11]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[12]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[13]  Rui Xu,et al.  Classification of Diffuse Lung Disease Patterns on High-Resolution Computed Tomography by a Bag of Words Approach , 2011, MICCAI.

[14]  Grace Chung,et al.  A method of extracting the number of trial participants from abstracts describing randomized controlled trials , 2008, Journal of telemedicine and telecare.

[15]  Daphne Koller,et al.  Restricted Bayes Optimal Classifiers , 2000, AAAI/IAAI.

[16]  Mark Ware,et al.  The STM report: An overview of scientific and scholarly journal publishing fourth edition , 2015 .

[17]  Hwee Tou Ng,et al.  Domain adaptation for semantic role labeling in the biomedical domain , 2010, Bioinform..

[18]  Juntae Yoon,et al.  Link-topic model for biomedical abbreviation disambiguation , 2015, J. Biomed. Informatics.

[19]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[20]  Peter C Gøtzsche,et al.  Data extraction errors in meta-analyses that use standardized mean differences. , 2007, JAMA.

[21]  Jian-Yun Nie,et al.  Combining classifiers for robust PICO element detection , 2010, BMC Medical Informatics Decis. Mak..

[22]  Jimmy J. Lin,et al.  Answering Clinical Questions with Knowledge-Based and Statistical Techniques , 2007, CL.

[23]  Duy Duc An Bui,et al.  Automatically finding relevant citations for clinical guideline development , 2015, J. Biomed. Informatics.

[24]  D. Cook,et al.  Systematic Reviews: Synthesis of Best Evidence for Clinical Decisions , 1997, Annals of Internal Medicine.

[25]  Halil Kilicoglu,et al.  Abstraction Summarization for Managing the Biomedical Research Literature , 2004, HLT-NAACL 2004.

[26]  J. Stockman How Quickly Do Systematic Reviews Go Out of Date? A Survival Analysis , 2009 .

[27]  Byron C. Wallace,et al.  Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision , 2016, J. Mach. Learn. Res..

[28]  Joel D. Martin,et al.  ExaCT: automatic extraction of clinical trial characteristics from journal publications , 2010, BMC Medical Informatics Decis. Mak..

[29]  Paula R Williamson,et al.  High prevalence but low impact of data extraction and reporting errors were found in Cochrane systematic reviews. , 2005, Journal of clinical epidemiology.

[30]  Siddhartha R. Jonnalagadda,et al.  PDF text classification to leverage information extraction from publication reports , 2016, J. Biomed. Informatics.

[31]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[32]  Jöran Beel,et al.  SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) , 2010, ECDL.

[33]  Emma Tavender,et al.  The Global Evidence Mapping Initiative: Scoping research in broad topic areas , 2011, BMC medical research methodology.

[34]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[35]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[36]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[37]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[38]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[39]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[40]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[41]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[42]  Geraldo Xexéo,et al.  A Language-Independent Acronym Extraction From Biomedical Texts With Hidden Markov Models , 2010, IEEE Transactions on Biomedical Engineering.