A Software Framework for Building Biomedical Machine Learning Classifiers through Grid Computing Resources

This paper describes the BiomedTK software framework, created to perform massive explorations of machine learning classifiers configurations for biomedical data analysis over distributed Grid computing resources. BiomedTK integrates ROC analysis throughout the complete classifier construction process and enables explorations of large parameter sweeps for training third party classifiers such as artificial neural networks and support vector machines, offering the capability to harness the vast amount of computing power serviced by Grid infrastructures. In addition, it includes classifiers modified by the authors for ROC optimization and functionality to build ensemble classifiers and manipulate datasets (import/export, extract and transform data, etc.). BiomedTK was experimentally validated by training thousands of classifier configurations for representative biomedical UCI datasets reaching in little time classification levels comparable to those reported in existing literature. The comprehensive method herewith presented represents an improvement to biomedical data analysis in both methodology and potential reach of machine learning based experimentation.

[1]  Tony R. Martinez,et al.  Improved Center Point Selection for Probabilistic Neural Networks , 1997, ICANNGA.

[2]  M. Elter,et al.  The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. , 2007, Medical physics.

[3]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[4]  Berkman Sahiner,et al.  Evaluating computer-aided detection algorithms. , 2007, Medical physics.

[5]  Jeff Heaton,et al.  Programming Neural Networks with Encog 2 in Java , 2010 .

[6]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[7]  Paolo Frasconi,et al.  New results on error correcting output codes of kernel machines , 2004, IEEE Transactions on Neural Networks.

[8]  Marian Bubak,et al.  Perspectives on grid computing , 2010, Future Gener. Comput. Syst..

[9]  D. Kranzlmüller,et al.  The European Grid Initiative (EGI) , 2010 .

[10]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[11]  Isabel Ramos,et al.  Grid infrastructures for developing mammography CAD systems , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[12]  Miguel Angel Guevara Lopez,et al.  EXPLOITING E-INFRASTRUCTURES FOR MEDICAL IMAGE STORAGE AND ANALYSIS: A GRID APPLICATION FOR MAMMOGRAPHY CAD , 2010 .

[13]  Ching Y. Suen,et al.  Error-Correcting Output Coding for the Convolutional Neural Network for Optical Character Recognition , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[14]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[15]  Zhi-Hua Zhou,et al.  Editing Training Data for kNN Classifiers with Neural Network Ensemble , 2004, ISNN.

[16]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[17]  Richard Nock,et al.  Stopping Criterion for Boosting-Based Data Reduction Techniques: from Binary to Multiclass Problem , 2003, J. Mach. Learn. Res..

[18]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  A review on the combination of binary classifiers in multiclass problems , 2008, Artificial Intelligence Review.

[19]  Jason H. Moore,et al.  Learning classifier systems: a complete introduction, review, and roadmap , 2009 .

[20]  Carlotta Domeniconi,et al.  Nearest neighbor ensemble , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[21]  Yonghong Peng,et al.  A novel feature selection approach for biomedical data classification , 2010, J. Biomed. Informatics.

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[24]  Péter Kacsuk Extending the Services and Sites of Production Grids by the Support of Advanced Portals , 2006, VECPAR.

[25]  Carlos Soares,et al.  Is the UCI Repository Useful for Data Mining? , 2003, EPIA.

[26]  Shaul Markovitch,et al.  Lookahead-based algorithms for anytime induction of decision trees , 2004, ICML.

[27]  Murat Dundar,et al.  A fast iterative algorithm for fisher discriminant using heterogeneous kernels , 2004, ICML.

[28]  George Nikiforidis,et al.  A perspective for biomedical data integration: Design of databases for flow cytometry , 2008, BMC Bioinformatics.

[29]  Adrian J. Shepherd,et al.  A computational Grid framework for immunological applications , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[30]  Dimitrios Gunopulos,et al.  Non-linear dimensionality reduction techniques for classification and visualization , 2002, KDD.

[31]  Yosvany López,et al.  Breast Cancer Diagnosis Based on a Suitable Combination of Deformable Models and Artificial Neural Networks Techniques , 2007, CIARP.

[32]  Federico Carminati,et al.  AliEn: ALICE environment on the GRID , 2008 .

[33]  H WittenIan,et al.  The WEKA data mining software , 2009 .

[34]  Thomas G. Dietterich,et al.  Error-Correcting Output Codes: A General Method for Improving Multiclass Inductive Learning Programs , 1991, AAAI.

[35]  Ji-Hyun Kim,et al.  Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , 2009, Comput. Stat. Data Anal..

[36]  Miguel Ángel Guevara-López,et al.  Introducing ROC Curves as Error Measure Functions: A New Approach to Train ANN-Based Biomedical Data Classifiers , 2010, CIARP.

[37]  Bilge Karaçali,et al.  Quasi-supervised learning for biomedical data analysis , 2010, Pattern Recognit..

[38]  P. Radeva,et al.  Coronary damage classification of patients with the Chagas disease with Error-Correcting Output Codes , 2008, 2008 4th International IEEE Conference Intelligent Systems.

[39]  Jinyan Li,et al.  Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL , 2003, WAIM.

[40]  J. Jossinet,et al.  Classification of breast tissue by electrical impedance spectroscopy , 2006, Medical and Biological Engineering and Computing.

[41]  Yosvany López,et al.  Computer Aided Diagnosis System to Detect Breast Cancer Pathological Lesions , 2008, CIARP.

[42]  Sergio Escalera,et al.  Subclass Problem-Dependent Design for Error-Correcting Output Codes , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Vicente Hernández,et al.  Content-based organisation of virtual repositories of DICOM objects , 2009, Future Gener. Comput. Syst..

[44]  Max A. Little,et al.  Suitability of Dysphonia Measurements for Telemonitoring of Parkinson's Disease , 2008, IEEE Transactions on Biomedical Engineering.

[45]  Dwijendra K. Ray-Chaudhuri,et al.  Binary mixture flow with free energy lattice Boltzmann methods , 2022, arXiv.org.

[46]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[47]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[48]  Daniel A. Keim,et al.  Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , 2002, KDD.