Genome sequence clustering using hybrid method: Self-organizing map and frequent max substring techniques

This paper proposes a genome sequence clustering based on the combination of two techniques: self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. The proposed technique appears to be a promising alternative for clustering a large amount of genome sequences in large sequence databases. To illustrate the proposed technique, experiment on clustering the genome sequences is presented in this paper. Firstly, the frequent max substring technique is applied to enumerate the interesting patterns 'called frequent max substrings' from the genome sequences. Then, these frequent max substrings are used as terms, together with their frequency, to form a sequence vector. Finally, self-organizing map is applied to generate the cluster map by using the vector generated from the earlier step. Consequently, the generated cluster map can be used to show the group of similar genome sequences as well as the group of different genome sequences.

[1]  Irina Matveeva Document Representation and Multilevel Measures of Document Similarity , 2006, HLT-NAACL.

[2]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[3]  Hong Xie,et al.  An automatic indexing technique for Thai texts using frequent max substring , 2009, 2009 Eighth International Symposium on Natural Language Processing.

[4]  Richard Coggins,et al.  Application of self-organizing maps to clustering of high-frequency Financial data , 2004, ACSW.

[5]  Xue Wu,et al.  ESTmapper: Efficiently Clustering EST Sequences Using Genome Maps , 2004 .

[6]  Mohammed J. Zaki,et al.  TRELLIS+: An Effective Approach for Indexing Genome-Scale Sequences Using Suffix Trees , 2008, Pacific Symposium on Biocomputing.

[7]  P. Kuwabara DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling , 2003 .

[8]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[9]  George Karypis,et al.  Comparison of Agglomerative and Partitional Document Clustering Algorithms , 2002 .

[10]  Mathieu Raffinot,et al.  High similarity sequence comparison in clustering large sequence databases , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[11]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[12]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[13]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[14]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[15]  Kok Wai Wong,et al.  Self-organising maps use for intelligent data analysis , 2000 .

[16]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[17]  Hitoshi Isahara,et al.  ORCHID: Thai Part-Of-Speech Tagged Corpus , 2009 .

[18]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[19]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[20]  Halit Eren,et al.  Modular artificial neural network for prediction of petrophysical properties from well log data , 1996, Quality Measurement: The Indispensable Bridge between Theory and Reality (No Measurements? No Science! Joint Conference - 1996: IEEE Instrumentation and Measurement Technology Conference and IMEKO Tec.

[21]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[22]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .