Machine Learning for Biomedical Literature Triage

This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm.

[1]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[2]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[3]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[4]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[5]  Houda Benbrahim,et al.  An empirical study to address the problem of Unbalanced Data Sets in sentiment classification , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[6]  Robert M. Stephens,et al.  Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data , 2013, PloS one.

[7]  Marie-Jean Meurs,et al.  Using Collaborative Tagging for Text Classification: From Text Classification to Opinion Mining , 2013, Informatics.

[8]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[9]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[11]  H WittenIan,et al.  The WEKA data mining software , 2009 .

[12]  Osmar R. Zaïane,et al.  Application of Data Mining Techniques for Medical Image Classification , 2001, MDM/KDD.

[13]  Ethan P. White,et al.  The EcoData Retriever: Improving Access to Existing Ecological Data , 2013, PloS one.

[14]  M. Wang,et al.  An Unsupervised Text Mining Method for Relation Extraction from Biomedical Literature , 2014, PloS one.

[15]  Caitlin Murphy,et al.  Semantic text mining support for lignocellulose research , 2012, BMC Medical Informatics and Decision Making.

[16]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[17]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[18]  D. Lipman,et al.  National Center for Biotechnology Information , 2019, Springer Reference Medizin.

[19]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[20]  Yike Guo,et al.  Parallel Clustering Algorithm for Large-Scale Biological Data Sets , 2014, PloS one.

[21]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[22]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[23]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[24]  Justin Powlowski,et al.  Curation of characterized glycoside hydrolases of Fungal origin , 2011, Database J. Biol. Databases Curation.

[25]  Andreas Stolcke,et al.  A study in machine learning from imbalanced data for sentence boundary detection in speech , 2006, Comput. Speech Lang..

[26]  María Lourdes Borrajo Diz,et al.  Improving imbalanced scientific text classification using sampling strategies and dictionaries , 2011, J. Integr. Bioinform..

[27]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[28]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[29]  Yanqing Zhang,et al.  Granular support vector machines with association rules mining for protein homology prediction , 2005, Artif. Intell. Medicine.

[30]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[31]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .