Genome data classification based on fuzzy matching

Genomic data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done on unknown genome identification and is based on exact pattern matching of n-grams. In most of the real world biological problems exact matching may not give desired results and the problem in using n-grams is exponential explosion. In this paper we propose a method for genome data classification based on approximate matching. The algorithm works by selecting random samples from the genome database. Tolerance is allowed by generating candidates of varied length to query from these sample sequences. The Levenshtein distance is then checked for each candidate and whether they are k-fuzzily equal. The total number of fuzzy matches for each sequence is then calculated. This is then classified using the data mining techniques namely, naive Bayes, support vector machine, back propagation and also by nearest neighbor. Experiment results are provided for different tolerance levels and they show that accuracy increases as tolerance does. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely Yeast and E. coli are used to verify proposed method.

[1]  Camille Serruys,et al.  Analysis of parametric images derived from genomic sequences using neural network based approaches , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[2]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[3]  Hassan Mathkour,et al.  Comparative genome sequence analysis by efficient pattern matching technique , 2008 .

[4]  Wei You,et al.  Classification of DNA Sequences Basing on the Dinucleotide Compositions , 2009, 2009 Second International Symposium on Computational Intelligence and Design.

[5]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[6]  Wang An-na,et al.  A novel construction of SVM compound kernel function , 2010, 2010 International Conference on Logistics Systems and Intelligent Management (ICLSIM).

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8]  Masoumeh Hasani,et al.  Evaluation of feed-forward back propagation and radial basis function neural networks in simultaneous kinetic spectrophotometric determination of nitroaniline isomers. , 2008, Talanta.

[9]  C. Ball,et al.  Genetic and physical maps of Saccharomyces cerevisiae. , 1997, Nature.

[10]  Abhijit J. Kulkarni,et al.  Fuzzy pattern extraction for classification of protein sequences , 2010 .

[11]  Amit Konar,et al.  Biological Data Mining for Genomic Clustering Using Unsupervised Neural Learning , 2007, Eng. Lett..

[12]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[13]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[14]  F. Zanoguera,et al.  Protein classification into domains of life using Markov chain models , 2004 .

[15]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[16]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[17]  Sankar K. Pal,et al.  Data mining in soft computing framework: a survey , 2002, IEEE Trans. Neural Networks.

[18]  Rowena Chau,et al.  Cluster identification and separation in the growing self-organizing map: application in protein sequence classification , 2009, Neural Computing and Applications.

[19]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[20]  Stuart Aitken,et al.  Mining housekeeping genes with a Naive Bayes classifier , 2006, BMC Genomics.

[21]  Edward Gately Neural networks for financial forecasting , 1995 .

[22]  K Nishikawa,et al.  Genes from nine genomes are separated into their organisms in the dinucleotide composition space. , 1998, DNA research : an international journal for rapid publication of reports on genes and genomes.

[23]  Qin Ding,et al.  k-nearest Neighbor Classification on Spatial Data Streams Using P-trees , 2002, PAKDD.

[24]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[25]  Tianzi Jiang,et al.  Esub8: A novel tool to predict protein subcellular localizations in eukaryotic organisms , 2004, BMC Bioinformatics.

[26]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[27]  Kumkum Garg,et al.  Species identification based on approximate matching , 2011, Bangalore Compute Conf..

[28]  U. Gyllensten,et al.  Mitochondrial sequence analysis for forensic identification using pyrosequencing technology. , 2002, BioTechniques.

[29]  Fan Yang,et al.  Gene Expression Classification: Decision Trees vs. SVMs , 2003, FLAIRS Conference.

[30]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[31]  Artem Cherkasov,et al.  Bioinformatics: A practical guide to the analysis of genes and proteins , 2005 .

[32]  Engelbert Mephu Nguifo,et al.  Protein sequences classification by means of feature extraction with substitution matrices , 2010, BMC Bioinformatics.

[33]  T. Boekhout,et al.  Biodiversity and systematics of basidiomycetous yeasts as determined by large-subunit rDNA D1/D2 domain sequence analysis. , 2000, International journal of systematic and evolutionary microbiology.

[34]  Hassan Mathkour,et al.  Genome Sequence Analysis: A Survey , 2009 .

[35]  Vlado Keselj,et al.  n-Gram-based classification and unsupervised hierarchical clustering of genome sequences , 2006, Comput. Methods Programs Biomed..

[36]  Bernhard Schölkopf,et al.  Feature selection and transduction for prediction of molecular bioactivity for drug design , 2003, Bioinform..

[37]  Kumkum Garg,et al.  Clustering genome data based on approximate matching , 2013, Int. J. Data Anal. Tech. Strateg..

[38]  N M Luscombe,et al.  What is Bioinformatics? A Proposed Definition and Overview of the Field , 2001, Methods of Information in Medicine.

[39]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[40]  Stephen A. Krawetz,et al.  Bioinformatics Methods and Protocols , 1999 .

[41]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[42]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[43]  Ashutosh Kumar,et al.  Species identification and authentication of tissues of animal origin using mitochondrial and nuclear markers. , 2007, Meat science.