Method of Fuzzy Matching Feature Extraction and Clustering Genome Data

Cluster analysis divides data into groups that are meaningful and useful. Sequence clustering is needed and contributes to the assessment of gene and species evolutionary relationships. Clustering methods are thus necessary to carry out these identification operations in an accurate and fast way. In this paper, a method for feature extraction based on fuzzy matching is proposed and these features are used for genome data clustering. Given a database of genome sequences, our proposed work includes generating candidates of length equal to query, find total number of approximate matching patterns to query with given fault tolerance and then using this total number of matches for clustering. Fuzzy C-Means algorithm is used for genome data clustering. Genome data of two species namely Yeast and E. coli are used to verify proposed method.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  W. Peizhuang Pattern Recognition with Fuzzy Objective Function Algorithms (James C. Bezdek) , 1983 .

[5]  J. Beckmann,et al.  Linguistics of nucleotide sequences: morphology and comparison of vocabularies. , 1986, Journal of biomolecular structure & dynamics.

[6]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[7]  K Nishikawa,et al.  Genes from nine genomes are separated into their organisms in the dinucleotide composition space. , 1998, DNA research : an international journal for rapid publication of reports on genes and genomes.

[8]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[9]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[10]  Rickard Sandberg,et al.  Using a Naïve Bayesian Classifier Capturing Whole-Genome Characteristics in Short Sequences , 2001 .

[11]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[12]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[13]  Rolf Backofen,et al.  COMPUTATIONAL MOLECULAR BIOLOGY: AN INTRODUCTION , 2000 .

[14]  Nadia Essoussi,et al.  Partitioning clustering algorithms for protein sequence data sets , 2009, BioData Mining.

[15]  Hassan Mathkour,et al.  Genome Sequence Analysis: A Survey , 2009 .

[16]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[17]  Kumkum Garg,et al.  Species identification based on approximate matching , 2011, Bangalore Compute Conf..