Clustering genome data based on approximate matching

Genome data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done for genome identification based on exact matching of n-grams. However, in most real world biological problems, it may not be feasible to have an exact match, so approximate matching may be desired. The problem in using n-grams is that the number of features 4

[1]  Amit Konar,et al.  Biological Data Mining for Genomic Clustering Using Unsupervised Neural Learning , 2007, Eng. Lett..

[2]  Wei You,et al.  Classification of DNA Sequences Basing on the Dinucleotide Compositions , 2009, 2009 Second International Symposium on Computational Intelligence and Design.

[3]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[4]  K Nishikawa,et al.  Genes from nine genomes are separated into their organisms in the dinucleotide composition space. , 1998, DNA research : an international journal for rapid publication of reports on genes and genomes.

[5]  U. Gyllensten,et al.  Mitochondrial sequence analysis for forensic identification using pyrosequencing technology. , 2002, BioTechniques.

[6]  Marina Meila,et al.  The uniqueness of a good optimum for K-means , 2006, ICML.

[7]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[8]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[9]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[10]  Bernhard Schölkopf,et al.  Feature selection and transduction for prediction of molecular bioactivity for drug design , 2003, Bioinform..

[11]  N M Luscombe,et al.  What is Bioinformatics? A Proposed Definition and Overview of the Field , 2001, Methods of Information in Medicine.

[12]  Sankar K. Pal,et al.  Data mining in soft computing framework: a survey , 2002, IEEE Trans. Neural Networks.

[13]  Rowena Chau,et al.  Cluster identification and separation in the growing self-organizing map: application in protein sequence classification , 2009, Neural Computing and Applications.

[14]  Hassan Mathkour,et al.  Genome Sequence Analysis: A Survey , 2009 .

[15]  Vlado Keselj,et al.  n-Gram-based classification and unsupervised hierarchical clustering of genome sequences , 2006, Comput. Methods Programs Biomed..

[16]  T. Boekhout,et al.  Biodiversity and systematics of basidiomycetous yeasts as determined by large-subunit rDNA D1/D2 domain sequence analysis. , 2000, International journal of systematic and evolutionary microbiology.

[17]  Rui Xu,et al.  Clustering Algorithms in Biomedical Research: A Review , 2010, IEEE Reviews in Biomedical Engineering.

[18]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[19]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[20]  Stephen A. Krawetz,et al.  Bioinformatics Methods and Protocols , 1999 .

[21]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[22]  Camille Serruys,et al.  Analysis of parametric images derived from genomic sequences using neural network based approaches , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[23]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[24]  James M. Keller,et al.  A possibilistic approach to clustering , 1993, IEEE Trans. Fuzzy Syst..

[25]  Hassan Mathkour,et al.  Comparative genome sequence analysis by efficient pattern matching technique , 2008 .

[26]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[27]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[28]  S. Dwight,et al.  Genetic and physical maps of Saccharomyces cerevisiae. , 1997, Methods in enzymology.

[29]  Robert Babuska,et al.  Fuzzy Modeling for Control , 1998 .

[30]  Chitta Baral,et al.  Fuzzy C-means Clustering with Prior Biological Knowledge , 2022 .

[31]  Abhijit J. Kulkarni,et al.  Fuzzy pattern extraction for classification of protein sequences , 2010 .

[32]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[33]  Ashutosh Kumar,et al.  Species identification and authentication of tissues of animal origin using mitochondrial and nuclear markers. , 2007, Meat science.

[34]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[35]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[36]  Walter Balzano,et al.  Genomic comparison using data mining techniques based on a possibilistic fuzzy sets model , 2007, Biosyst..

[37]  Edward Gately Neural networks for financial forecasting , 1995 .

[38]  F. Zanoguera,et al.  Protein classification into domains of life using Markov chain models , 2004 .

[39]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.