PDF text classification to leverage information extraction from publication reports

OBJECTIVES Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems. METHODS We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions. RESULTS The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p<0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p=0.002). It also reduced of number of sentences to be processed by 44.9% (p<0.001), which corresponds to a processing time reduction of 50% (p=0.005). CONCLUSIONS The rule-based multi-pass sieve framework can be used effectively in categorizing texts extracted from PDF documents. Text classification is an important prerequisite step to leverage information extraction from PDF documents.

[1]  José Luis Rojo-Álvarez,et al.  Support Vector Feature Selection for Early Detection of Anastomosis Leakage From Bag-of-Words in Electronic Health Records , 2016, IEEE Journal of Biomedical and Health Informatics.

[2]  Andrei Voronkov,et al.  PDFX: fully-automated PDF-to-XML conversion of scientific literature , 2013, ACM Symposium on Document Engineering.

[3]  Massimo Ruffolo,et al.  PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[4]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[5]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[6]  Rui Xu,et al.  Classification of Diffuse Lung Disease Patterns on High-Resolution Computed Tomography by a Bag of Words Approach , 2011, MICCAI.

[7]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[8]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[9]  William B. Langdon,et al.  BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[10]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[11]  Roman Kern,et al.  TeamBeam - Meta-Data Extraction from Scientific Literature , 2012, D Lib Mag..

[12]  Karin M. Verspoor,et al.  Detection of Protein Catalytic Sites in the Biomedical Literature , 2013, Pacific Symposium on Biocomputing.

[13]  Ricky K. Taira,et al.  Automated Extraction of Reported Statistical Analyses: Towards a Logical Representation of Clinical Trial Literature , 2012, AMIA.

[14]  Mark Ware,et al.  The STM report: An overview of scientific and scholarly journal publishing fourth edition , 2015 .

[15]  Rodney L. Summerscales,et al.  AUTOMATIC SUMMARIZATION OF CLINICAL ABSTRACTS FOR EVIDENCE-BASED MEDICINE , 2013 .

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Manabu Torii,et al.  A Hybrid Approach to Sentiment Sentence Classification in Suicide Notes , 2012, Biomedical informatics insights.

[18]  Min-Yen Kan,et al.  Logical Structure Recovery in Scholarly Articles with Rich Document Features , 2010, Int. J. Digit. Libr. Syst..

[19]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[20]  K. Shojania,et al.  How Quickly Do Systematic Reviews Go Out of Date? A Survival Analysis , 2007, Annals of Internal Medicine.

[21]  Jau-Min Wong,et al.  PICO element detection in medical text without metadata: Are first sentences enough? , 2013, J. Biomed. Informatics.

[22]  Roman Kern,et al.  A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles , 2014, D Lib Mag..

[23]  Nguyen Ha Vo,et al.  Efficient Extraction of Protein-Protein Interactions from Full-Text Articles , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Duy Duc An Bui,et al.  Automatically finding relevant citations for clinical guideline development , 2015, J. Biomed. Informatics.

[25]  Joel D. Martin,et al.  ExaCT: automatic extraction of clinical trial characteristics from journal publications , 2010, BMC Medical Informatics Decis. Mak..

[26]  Roman Kern,et al.  An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles , 2013, TPDL.

[27]  Paula R Williamson,et al.  High prevalence but low impact of data extraction and reporting errors were found in Cochrane systematic reviews. , 2005, Journal of clinical epidemiology.

[28]  Jian-Yun Nie,et al.  Combining classifiers for robust PICO element detection , 2010, BMC Medical Informatics Decis. Mak..

[29]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[30]  Duy Duc An Bui,et al.  Research and applications: Learning regular expressions for clinical text classification , 2014, J. Am. Medical Informatics Assoc..

[31]  Yuan Ni,et al.  Automatic extracting of patient-related attributes: disease, age, gender and race. , 2012, Studies in health technology and informatics.

[32]  Luis Anido Rifón,et al.  Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach , 2015 .

[33]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[34]  Ulrich Schäfer,et al.  Advances in Deep Parsing of Scholarly Paper Content , 2009, NLP4DL/AT4DL.

[35]  Victor Maojo,et al.  A knowledge engineering approach to recognizing and extracting sequences of nucleic acids from scientific literature , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[36]  Joel D. Martin,et al.  Automated Information Extraction of Key Trial Design Elements from Clinical Trial Publications , 2008, AMIA.

[37]  Roman Kern,et al.  A comparison of layout based bibliographic metadata extraction techniques , 2012, WIMS '12.

[38]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[39]  D. Sackett,et al.  Evidence based medicine: what it is and what it isn't , 1996, BMJ.

[40]  Sung Hyun Kim,et al.  A Multi-Classifier Based Guideline Sentence Classification System , 2011, Healthcare informatics research.

[41]  Mohamed Ben Ahmed,et al.  Table recognition evaluation and combination methods , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[42]  Siddhartha Jonnalagadda,et al.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules , 2012, J. Am. Medical Informatics Assoc..

[43]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[44]  Jöran Beel,et al.  SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) , 2010, ECDL.

[45]  Emma Tavender,et al.  The Global Evidence Mapping Initiative: Scoping research in broad topic areas , 2011, BMC medical research methodology.