CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

High accuracy text classifiers are used nowadays in organizing large amounts of biomedical information and supporting clinical decision-making processes. In medical informatics, regular expression-based classifiers have emerged as an alternative to traditional, discriminative classification algorithms due to their ability to model sequential patterns. This article presents CREGEX (Classifier Regular Expression), a biomedical text classifier based on an automatically generated regular-expressions-based feature space. We conceived an algorithm for automatically constructing an informative and discriminative regular-expressions-based feature space, suitable for binary and multiclass discrimination problems. Regular expressions are automatically generated from training texts using a coarse-to-fine text aligning method, which trades off the lexical variants of words, in terms of gender and grammatical number, and the generation of a feature space containing a large number of noisy features. CREGEX carries out feature selection by filtering keywords and also computes a confidence metric to classify test texts. Three de-identified datasets in Spanish, with information on smoking habits, obesity, and obesity types, were used here to assess the performance of CREGEX. For comparison, Support Vector Machine (SVM) and Naïve Bayes (NB) supervised classifiers were also trained with consecutive sequences of tokens (n-grams) as features. Results show that, in all the datasets used for evaluation, CREGEX not only outperformed both the SVM and NB classifiers in terms of accuracy and F-measure (p-value<0.05) but also used a fewer amount of training examples to achieve the same performance. Such a superior performance is attributed to the regular expressions’ ability to represent complex text patterns.

[1]  Hendrik Blockeel,et al.  On estimating model accuracy with repeated cross-validation , 2012 .

[2]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[3]  Dongmei Zhang,et al.  Generating Regular Expressions from Natural Language Specifications: Are We There Yet? , 2018, AAAI Workshops.

[4]  Jian Weng,et al.  Feature selection for text classification: A review , 2018, Multimedia Tools and Applications.

[5]  K. Bertels,et al.  GPU accelerated sequence alignment with traceback for GATK HaplotypeCaller , 2019, BMC Genomics.

[6]  Goran Nenadic,et al.  A framework for information extraction from tables in biomedical literature , 2019, International Journal on Document Analysis and Recognition (IJDAR).

[7]  Phil McMinn,et al.  Automatic generation of valid and invalid test data for string validation routines using web searches and regular expressions , 2015, Sci. Comput. Program..

[8]  Amber Smith,et al.  Simulating Dependencies to Improve Parse Error Detection , 2017, TLT.

[9]  William M. Pottenger,et al.  A semi-supervised active learning algorithm for information extraction from textual data , 2005, J. Assoc. Inf. Sci. Technol..

[10]  Robert Rieger,et al.  Enabling information extraction by inference of regular expressions from sample entities , 2011, CIKM '11.

[11]  Rosa L. Figueroa,et al.  Extracting Information from Electronic Medical Records to Identify the Obesity Status of a Patient Based on Comorbidities and Bodyweight Measures , 2016, Journal of Medical Systems.

[12]  Shijun Liu,et al.  To Transfer or Not: An Online Cost Optimization Algorithm for Using Two-Tier Storage-as-a-Service Clouds , 2019, IEEE Access.

[13]  Mehmet Kayaalp,et al.  Extracting laboratory test information from biomedical text , 2013, Journal of pathology informatics.

[14]  Eduardo P. Wiechmann,et al.  Active learning for clinical text classification: is it better than random sampling? , 2012, J. Am. Medical Informatics Assoc..

[15]  Qing Zeng-Treitler,et al.  Regular expression-based learning to extract bodyweight values from clinical notes , 2015, J. Biomed. Informatics.

[16]  Rohit Babbar,et al.  Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text , 2010, AND '10.

[17]  Karin Murthy,et al.  Improving Recall of Regular Expressions for Information Extraction , 2012, WISE.

[18]  Juan Jose García Adeva,et al.  Automatic text classification to support systematic reviews in medicine , 2014, Expert Syst. Appl..

[19]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[20]  Eric Medvet,et al.  Automatic Synthesis of Regular Expressions from Examples , 2014, Computer.

[21]  Horacio Rodríguez,et al.  Syntactic methods for negation detection in radiology reports in Spanish , 2016, BioNLP@ACL.

[22]  Roberto Navigli,et al.  Knowledge-enhanced document embeddings for text classification , 2019, Knowl. Based Syst..

[23]  Huzefa Rangwala,et al.  HierCost: Improving Large Scale Hierarchical Classification with Cost Sensitive Learning , 2015, ECML/PKDD.

[24]  Eric Medvet,et al.  Automatic Search-and-Replace From Examples With Coevolutionary Genetic Programming , 2019, IEEE Transactions on Cybernetics.

[25]  Derya Yakar,et al.  Diagnostic Performance of Computed Tomography for Preoperative Staging of Patients with Non-endometrioid Carcinomas of the Uterine Corpus , 2016, Annals of Surgical Oncology.

[26]  A. Shachak,et al.  The impact of electronic medical records on patient-doctor communication during consultation: a narrative literature review. , 2009, Journal of evaluation in clinical practice.

[27]  Esteban J. Pino,et al.  Identifying and extracting patient smoking status information from clinical narrative texts in Spanish , 2014, 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[28]  Jeff Daily,et al.  Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments , 2016, BMC Bioinformatics.

[29]  Ahmet Cetinkaya Regular expression generation through grammatical evolution , 2007, GECCO '07.

[30]  Uwe Aickelin,et al.  Regular Expression Based Medical Text Classification Using Constructive Heuristic Approach , 2019, IEEE Access.

[31]  Kathryn T. Stolee,et al.  Exploring Regular Expression Evolution , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[32]  Duy Duc An Bui,et al.  Extractive text summarization system to aid data extraction from full text in systematic review development , 2016, J. Biomed. Informatics.

[33]  Alaa M. El-Halees,et al.  Arabic Opinion Mining Using Distributed Representations of Documents , 2017, 2017 Palestinian International Conference on Information and Communication Technology (PICICT).

[34]  Gurinder Singh,et al.  Comparison between Multinomial and Bernoulli Naïve Bayes for Text Classification , 2019, 2019 International Conference on Automation, Computational and Technology Management (ICACTM).

[35]  Anuj Sharma,et al.  Problem formulations and solvers in linear SVM: a review , 2018, Artificial Intelligence Review.

[36]  Dale Lamont Denis High-Performance Regular Expression Matching with Parabix and LLVM , 2014 .

[37]  Paolo Arcaini,et al.  Fault‐based test generation for regular expressions by mutation , 2019, Softw. Test. Verification Reliab..

[38]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[39]  Helena Gómez-Adorno,et al.  Computing text similarity using Tree Edit Distance , 2015, 2015 Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC).

[40]  Anita Burgun-Parenthoine,et al.  Using regular expressions to extract information on pacemaker implantation procedures from clinical reports , 2008, AMIA.

[41]  Muhammad Rafi,et al.  Comparing SVM and naïve Bayes classifiers for text categorization with Wikitology as knowledge enrichment , 2011, 2011 IEEE 14th International Multitopic Conference.