Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Increasing volumes of data with the increased availability information mandates the use of data mining techniques in order to gather useful information from the datasets. In this chapter, data mining techniques are described with a special emphasis on classification techniques as one important supervised learning technique. Bioinformatics tools in the field for medical applications especially in medical microbiology are discussed. This chapter presents WEKA software as a tool of choice to perform classification analysis for different kinds of available data. Uses of WEKA data mining tools for biological applications such as genomic analysis and for medical applications such as diabetes are discussed. Data mining offers novel tools for medical applications for infectious diseases; it can help in identifying the pathogen and analyzing the drug resistance pattern. For non-communicable diseases such as diabetes, it provides excellent data analysis options for analyzing large volumes of data from many clinical studies. Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[3]  Paulo Martins Engel,et al.  Automated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques , 2002, ECCB.

[4]  Bernhard Y. Renard,et al.  PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data , 2017, Scientific Reports.

[5]  Jason E. Stewart,et al.  Design and implementation of microarray gene expression markup language (MAGE-ML) , 2002, Genome Biology.

[6]  Saman Hina,et al.  Analyzing Diabetes Datasets using Data Mining , 2017 .

[7]  J. D. Malley,et al.  Probability Machines , 2011, Methods of Information in Medicine.

[8]  L. Schouls,et al.  Multiple-Locus Variable Number Tandem Repeat Analysis of Staphylococcus Aureus: Comparison with Pulsed-Field Gel Electrophoresis and spa-Typing , 2009, PloS one.

[9]  Alex van Belkum,et al.  Role of Genomic Typing in Taxonomy, Evolutionary Genetics, and Microbial Epidemiology , 2001, Clinical Microbiology Reviews.

[10]  Weiguo Fan,et al.  Discovering Ranking Functions for Information Retrieval , 2005 .

[11]  Andrew C. Pawlowski,et al.  The Comprehensive Antibiotic Resistance Database , 2013, Antimicrobial Agents and Chemotherapy.

[12]  Jude W. Shavlik,et al.  Evaluating machine learning approaches for aiding probe selection for gene-expression arrays , 2002, ISMB.

[13]  A. Klindworth,et al.  Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies , 2012, Nucleic acids research.

[14]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[15]  Peter F. Stadler,et al.  DARIO: a ncRNA detection and analysis tool for next-generation sequencing experiments , 2011, Nucleic Acids Res..

[16]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[17]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[18]  M. Kannan,et al.  Analysis of a Population of Diabetic Patients Databases in Weka Tool , 2011 .

[19]  Yan Zhang,et al.  PATRIC, the bacterial bioinformatics database and analysis resource , 2013, Nucleic Acids Res..

[20]  João André Carriço,et al.  Bioinformatics in bacterial molecular epidemiology and public health: databases, tools and the next-generation sequencing revolution. , 2013, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[21]  Peter Kokol,et al.  Gene set enrichment meta-learning analysis: next- generation sequencing versus microarrays , 2009, BMC Bioinformatics.

[22]  R. Yuste,et al.  Comparison Between Supervised and Unsupervised Classifications of Neuronal Cell Types: A Case Study , 2010, Developmental neurobiology.

[23]  J. Derisi,et al.  Profile Hidden Markov Models for the Detection of Viruses within Metagenomic Sequence Data , 2014, PloS one.

[24]  Blaz Zupan,et al.  Open-source tools for data mining. , 2008, Clinics in laboratory medicine.

[25]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[26]  Noah G. Hoffman,et al.  Molecular Diagnosis of Actinomadura madurae Infection by 16S rRNA Deep Sequencing , 2013, Journal of Clinical Microbiology.

[27]  Illhoi Yoo,et al.  Data Mining in Healthcare and Biomedicine: A Survey of the Literature , 2012, Journal of Medical Systems.

[28]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[29]  Gregory B. Gloor,et al.  Deep Sequencing of the Vaginal Microbiota of Women with HIV , 2010, PloS one.

[30]  M. Gerstein,et al.  What is bioinformatics ? An introduction and overview , 2001 .

[31]  Erik Kristiansson,et al.  BacMet: antibacterial biocide and metal resistance genes database , 2013, Nucleic Acids Res..

[32]  Ole Lund,et al.  PathogenFinder - Distinguishing Friend from Foe Using Bacterial Whole Genome Sequence Data , 2013, PloS one.

[33]  S. Rasmussen,et al.  Identification of acquired antimicrobial resistance genes , 2012, The Journal of antimicrobial chemotherapy.

[34]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. , 2002 .

[35]  F. Mooi,et al.  Molecular typing of methicillin-resistantStaphylococcus aureus on the basis of protein A gene polymorphism , 2005, European Journal of Clinical Microbiology and Infectious Diseases.

[36]  Michel Termier,et al.  Towards a computational model for −1 eukaryotic frameshifting sites , 2003, Bioinform..

[37]  M. Achtman,et al.  Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Usama M. Fayyad,et al.  Data mining and KDD: Promise and challenges , 1997, Future Gener. Comput. Syst..

[39]  M. Struelens Consensus guidelines for appropriate use and evaluation of microbial epidemiologic typing systems. , 1996, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[40]  L. Patthy Genome evolution and the evolution of exon-shuffling--a review. , 1999, Gene.

[41]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[42]  Teddy Mantoro,et al.  A Comparison Study of Classifier Algorithms for Mobile-phone's Accelerometer Based Activity Recognition , 2012 .

[43]  George M. Weinstock,et al.  Genomic approaches to studying the human microbiota , 2012, Nature.

[44]  Ross D. King,et al.  Application of metabolomics to plant genotype discrimination using statistics and machine learning , 2002, ECCB.

[45]  Kun Qu,et al.  Rapid identification of non-human sequences in high-throughput sequencing datasets , 2012, Bioinform..

[46]  Moustafa Ghanem,et al.  Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support , 2012, BMC Bioinformatics.

[47]  Anael Sam,et al.  Diabetes Forecasting Using Supervised Learning Techniques , 2014 .

[48]  Hongzhi Wang Innovative Techniques and Applications of Entity Resolution , 2014 .

[49]  T. Velmurugan,et al.  Analyzing Diabetic Data using Classification Algorithms in Data Mining , 2016 .

[50]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[51]  J. Zenilman,et al.  Porin variation among clinical isolates of Neisseria gonorrhoeae over a 10-year period, as determined by Por variable region typing. , 2003, The Journal of infectious diseases.

[52]  John N. Weinstein,et al.  VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue , 2013, Bioinform..

[53]  Huiqing Liu,et al.  Discovery of significant rules for classifying cancer diagnosis data , 2003, ECCB.

[54]  Kaushik H. Raviya,et al.  Performance Evaluation of Different Data Mining Classification Algorithm Using WEKA , 2012 .

[55]  Amr T. M. Saeb Current Bioinformatics resources in combating infectious diseases , 2018, Bioinformation.

[56]  Philip Calvert,et al.  Encyclopedia of Data Warehousing and Mining , 2006 .

[57]  Mihai Pop,et al.  ARDB—Antibiotic Resistance Genes Database , 2008, Nucleic Acids Res..

[58]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[59]  Mohamed Abouelhoda,et al.  The Use of Next-Generation Sequencing in the Identification of a Fastidious Pathogen: A Lesson From a Clinical Setup , 2017, Evolutionary bioinformatics online.