Effective clustering of microRNA sequences by N-grams and feature weighting

MicroRNA (miRNA in short) is a kind of small RNAs that acts as an important post-transcriptional regulator with the Argonaute family of proteins to regulate target mRNAs in animals and plants etc. Since its first recognition as a distinct class of small RNA molecules in the early 1990s, tens of thousands of miRNAs have been identified experimentally or computationally. Currently, the focus of miRNAs study is on single-miRNA functions that usually result in gene silencing and repression. With the rapid increase of miRNAs, biologists have manually organized these miRNAs into biologically meaningful families to facilitate further study. As the members in the same family tend to share similar biochemical functions, a high quality family organization will shed lights on the functions of unknown miRNAs. However, manually grouping large amounts of miRNAs is not only time-consuming but also expensive. In this paper, we employ a clustering method with N-grams and feature weighting to automatically group miRNAs into separate clusters (families). Our method is evaluated with datasets constructed from the online miRNA database miRBase. Experimental results show that the clustering method can successfully distinguishes most miRNA families, and outperforms the traditional K-means clustering algorithm and the average-link clustering approach.

[1]  Hichem Frigui,et al.  Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents , 2004 .

[2]  Ana Kozomara,et al.  miRBase: integrating microRNA annotation and deep-sequencing data , 2010, Nucleic Acids Res..

[3]  B. Reinhart,et al.  The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans , 2000, Nature.

[4]  Stijn van Dongen,et al.  miRBase: microRNA sequences, targets and gene nomenclature , 2005, Nucleic Acids Res..

[5]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Shuigeng Zhou,et al.  miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM , 2011, BMC Bioinformatics.

[8]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  E. Sontheimer,et al.  Origins and Mechanisms of miRNAs and siRNAs , 2009, Cell.

[10]  D. Bartel MicroRNAs: Target Recognition and Regulatory Functions , 2009, Cell.

[11]  Sam Griffiths-Jones,et al.  The microRNA Registry , 2004, Nucleic Acids Res..

[12]  V. Ambros,et al.  The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 , 1993, Cell.

[13]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[14]  Hichem Frigui,et al.  Simultaneous clustering and attribute discrimination , 2000, Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063).

[15]  Stijn van Dongen,et al.  miRBase: tools for microRNA genomics , 2007, Nucleic Acids Res..

[16]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .