The method of N-grams in large-scale clustering of DNA texts

This paper is devoted to the techniques of clustering of texts based on the comparison of vocabularies of N-grams. In contrast to the regular N-grams approach, the proposed N-grams method is based on calculation of imperfect occurrences of N-grams in a text up to a number of mismatched strings. We demonstrated that such an approach essentially improves the resolving capacity of the N-grams method for DNA texts. Additionally, we discuss a mutual usage scheme of different clustering technique types to verify the partition quality.

[1]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[2]  S Karlin,et al.  Heterogeneity of genomes: measures and values. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Jeong Soo Ahn,et al.  Using n-grams for Korean text retrieval , 1996, SIGIR '96.

[4]  Peter Willett,et al.  Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods , 1992, SIGIR '92.

[5]  E. Forgy Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[6]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[7]  Stephen Huffman Acquaintance: Language-Independent Document Categorization by N-Grams , 1995, TREC.

[8]  Alan M. Frieze,et al.  Optimal Reconstruction of a Sequence from its Probes , 1999, J. Comput. Biol..

[9]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[10]  C. L. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings: Rejoinder , 1983 .

[11]  Longin Jan Latecki,et al.  Tree-structured partitioning based on splitting histograms of distances , 2003, Third IEEE International Conference on Data Mining.

[12]  Joachim M. Buhmann,et al.  A Resampling Approach to Cluster Validation , 2002, COMPSTAT.

[13]  W. B. Cavnar,et al.  Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[14]  E N Trifonov,et al.  Linguistic measure of taxonomic and functional relatedness of nucleotide sequences. , 1990, Journal of biomolecular structure & dynamics.

[15]  R. Huber,et al.  The complete genome of the hyperthermophilic bacterium Aquifex aeolicus , 1998, Nature.

[16]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[17]  Zeev Volkovich,et al.  Text mining with information-theoretic clustering , 2003, Comput. Sci. Eng..

[18]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[19]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[20]  Franco P. Preparata,et al.  Sequencing by hybridization using direct and reverse cooperating spectra , 2002, RECOMB '02.

[21]  Jonathan D. Cohen Highlights: language- and domain-independent automatic indexing terms for abstracting , 1995 .

[22]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[23]  Franco P. Preparata,et al.  Sequencing-by-hybridization revisited: the analog-spectrum proposal , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  T. de Heer Experiments with syntactic traces in information retrieval , 1974, Inf. Storage Retr..

[25]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[26]  E. Nevo,et al.  A Large-Scale Comparison of Genomic Sequences: One Promising Approach , 2003, Acta Biotheoretica.

[27]  Stephen Huffman,et al.  Acquaintance: A Novel Vector-Space N-Gram Technique for Document Categorization , 1994, TREC.

[28]  E. Nevo,et al.  Compositional spectrum—revealing patterns for genomic sequence characterization and comparison , 2002 .

[29]  Alexander Bolshoy,et al.  DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity. , 2003, Applied bioinformatics.

[30]  Elizabeth S. Adams,et al.  Trigrams as index element in full text retrieval: observations and experimental results , 1993, CSC '93.

[31]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .