论文信息 - Genome data classification based on fuzzy matching

Genome data classification based on fuzzy matching

Genomic data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done on unknown genome identification and is based on exact pattern matching of n-grams. In most of the real world biological problems exact matching may not give desired results and the problem in using n-grams is exponential explosion. In this paper we propose a method for genome data classification based on approximate matching. The algorithm works by selecting random samples from the genome database. Tolerance is allowed by generating candidates of varied length to query from these sample sequences. The Levenshtein distance is then checked for each candidate and whether they are k-fuzzily equal. The total number of fuzzy matches for each sequence is then calculated. This is then classified using the data mining techniques namely, naive Bayes, support vector machine, back propagation and also by nearest neighbor. Experiment results are provided for different tolerance levels and they show that accuracy increases as tolerance does. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely Yeast and E. coli are used to verify proposed method.

[1] Camille Serruys,et al. Analysis of parametric images derived from genomic sequences using neural network based approaches , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[2] R. Sandberg,et al. Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[3] Hassan Mathkour,et al. Comparative genome sequence analysis by efficient pattern matching technique , 2008 .

[4] Wei You,et al. Classification of DNA Sequences Basing on the Dinucleotide Compositions , 2009, 2009 Second International Symposium on Computational Intelligence and Design.

[5] Dursun Delen,et al. Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[6] Wang An-na,et al. A novel construction of SVM compound kernel function , 2010, 2010 International Conference on Logistics Systems and Intelligent Management (ICLSIM).

[7] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8] Masoumeh Hasani,et al. Evaluation of feed-forward back propagation and radial basis function neural networks in simultaneous kinetic spectrophotometric determination of nitroaniline isomers. , 2008, Talanta.

[9] C. Ball,et al. Genetic and physical maps of Saccharomyces cerevisiae. , 1997, Nature.

[10] Abhijit J. Kulkarni,et al. Fuzzy pattern extraction for classification of protein sequences , 2010 .

[11] Amit Konar,et al. Biological Data Mining for Genomic Clustering Using Unsupervised Neural Learning , 2007, Eng. Lett..

[12] Andreas D. Baxevanis,et al. Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[13] Pat Langley,et al. Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[14] F. Zanoguera,et al. Protein classification into domains of life using Markov chain models , 2004 .

[15] 김동규,et al. [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[16] João Meidanis,et al. Introduction to computational molecular biology , 1997 .

[17] Sankar K. Pal,et al. Data mining in soft computing framework: a survey , 2002, IEEE Trans. Neural Networks.

[18] Rowena Chau,et al. Cluster identification and separation in the growing self-organizing map: application in protein sequence classification , 2009, Neural Computing and Applications.