Comparative Analysis of Data Mining Tools and Classification Techniques using WEKA in Medical Bioinformatics

The availability of huge amounts of data resulted in great need of data mining technique in order to generate useful knowledge. In the present study we provide detailed information about data mining techniques with more focus on classification techniques as one important supervised learning technique. We also discuss WEKA software as a tool of choice to perform classification analysis for different kinds of available data. A detailed methodology is provided to facilitate utilizing the software by a wide range of users. The main features of WEKA are 49 data preprocessing tools, 76 classification/regression algorithms, 8 clustering algorithms, 3 algorithms for finding association rules, 15 attribute/subset evaluators plus 10 search algorithms for feature selection. WEKA extracts useful information from data and enables a suitable algorithm for generating an accurate predictive model from it to be identified. Moreover, medical bioinformatics analyses have been performed to illustrate the usage of WEKA in the diagnosis of Leukemia.

[1]  Kaushik H. Raviya,et al.  Performance Evaluation of Different Data Mining Classification Algorithm Using WEKA , 2012 .

[2]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[3]  Paulo Martins Engel,et al.  Automated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques , 2002, ECCB.

[4]  M. Gerstein,et al.  What is bioinformatics ? An introduction and overview , 2001 .

[5]  Usama M. Fayyad,et al.  Data mining and KDD: Promise and challenges , 1997, Future Gener. Comput. Syst..

[6]  Peter Kokol,et al.  Gene set enrichment meta-learning analysis: next- generation sequencing versus microarrays , 2009, BMC Bioinformatics.

[7]  R. Yuste,et al.  Comparison Between Supervised and Unsupervised Classifications of Neuronal Cell Types: A Case Study , 2010, Developmental neurobiology.

[8]  Jason E. Stewart,et al.  Design and implementation of microarray gene expression markup language (MAGE-ML) , 2002, Genome Biology.

[9]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[10]  Jude W. Shavlik,et al.  Evaluating machine learning approaches for aiding probe selection for gene-expression arrays , 2002, ISMB.

[11]  Wen-Han Yu,et al.  The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information , 2010, Database J. Biol. Databases Curation.

[12]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Teddy Mantoro,et al.  A Comparison Study of Classifier Algorithms for Mobile-phone's Accelerometer Based Activity Recognition , 2012 .

[14]  Peter F. Stadler,et al.  DARIO: a ncRNA detection and analysis tool for next-generation sequencing experiments , 2011, Nucleic Acids Res..

[15]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[16]  Adriano Barbosa-Silva,et al.  Moving pieces in a taxonomic puzzle: venom 2D-LC/MS and data clustering analyses to infer phylogenetic relationships in some scorpions from the Buthidae family (Scorpiones). , 2006, Toxicon : official journal of the International Society on Toxinology.

[17]  Zhengxin Chen,et al.  SeqMaT: A sequence manipulation tool for phylogenetic analysis , 2011, Bioinformation.

[18]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[19]  Huiqing Liu,et al.  Discovery of significant rules for classifying cancer diagnosis data , 2003, ECCB.

[20]  R. Søkilde,et al.  Quantitative miRNA expression analysis: comparing microarrays with next-generation sequencing. , 2009, RNA.

[21]  Blaz Zupan,et al.  Open-source tools for data mining. , 2008, Clinics in laboratory medicine.

[22]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns , 2002, Bioinform..

[23]  Michel Termier,et al.  Towards a computational model for −1 eukaryotic frameshifting sites , 2003, Bioinform..

[24]  Halima Bensmail,et al.  Data Mining in Genomics and Proteomics , 2005, Journal of biomedicine & biotechnology.

[25]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[26]  Huiqing Liu,et al.  Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients , 2003, Bioinform..

[27]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[28]  Illhoi Yoo,et al.  Data Mining in Healthcare and Biomedicine: A Survey of the Literature , 2012, Journal of Medical Systems.

[29]  Gregory B. Gloor,et al.  Deep Sequencing of the Vaginal Microbiota of Women with HIV , 2010, PloS one.

[30]  Ross D. King,et al.  Application of metabolomics to plant genotype discrimination using statistics and machine learning , 2002, ECCB.