论文信息 - The method of N-grams in large-scale clustering of DNA texts

The method of N-grams in large-scale clustering of DNA texts

This paper is devoted to the techniques of clustering of texts based on the comparison of vocabularies of N-grams. In contrast to the regular N-grams approach, the proposed N-grams method is based on calculation of imperfect occurrences of N-grams in a text up to a number of mismatched strings. We demonstrated that such an approach essentially improves the resolving capacity of the N-grams method for DNA texts. Additionally, we discuss a mutual usage scheme of different clustering technique types to verify the partition quality.

[1] M Damashek,et al. Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[2] S Karlin,et al. Heterogeneity of genomes: measures and values. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[3] Jeong Soo Ahn,et al. Using n-grams for Korean text retrieval , 1996, SIGIR '96.

[4] Peter Willett,et al. Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods , 1992, SIGIR '92.

[5] E. Forgy. Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[6] Robert R. Sokal,et al. A statistical method for evaluating systematic relationships , 1958 .

[7] Stephen Huffman. Acquaintance: Language-Independent Document Categorization by N-Grams , 1995, TREC.

[8] Alan M. Frieze,et al. Optimal Reconstruction of a Sequence from its Probes , 1999, J. Comput. Biol..

[9] William M. Rand,et al. Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[10] C. L. Mallows,et al. A Method for Comparing Two Hierarchical Clusterings: Rejoinder , 1983 .

[11] Longin Jan Latecki,et al. Tree-structured partitioning based on splitting histograms of distances , 2003, Third IEEE International Conference on Data Mining.

[12] Joachim M. Buhmann,et al. A Resampling Approach to Cluster Validation , 2002, COMPSTAT.

[13] W. B. Cavnar,et al. Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[14] E N Trifonov,et al. Linguistic measure of taxonomic and functional relatedness of nucleotide sequences. , 1990, Journal of biomolecular structure & dynamics.