Time Series Data Mining Algorithms for Identifying Short RNA in Arabidopsis thaliana

The class of molecules called short RNAs (sRNAs) are known to play a key role in gene regulation. Th are typically sequences of nucleotides between 21-25 nucleotides in length. They are known to play a key role in gene regulation. The identification, clustering and classification of sRNA has recently become the focus of much research activity. The basic problem involves detecting regions of interest on the chromosome where the pattern of candidate matches is somehow unusual. Currently, there are no published algorithms for detecting regions of interest, and the unpublished methods that we are aware of involve bespoke rule based systems designed for a specific organism. Work in this very new field has understandably focused on the outcomes rather than the methods used to obtain the results. In this paper we propose two generic approaches that place the specific biological problem in the wider context of time series data mining problems. Both methods are based on treating the occurrences on a chromosome, or “hit count” data, as a time series, then running a sliding window along a chromosome and measuring unusualness. This formulation means we can treat finding unusual areas of candidate RNA activity as a variety of time series anomaly detection problem. The first set of approaches is model based. We specify a null hypothesis distribution for not being a sRNA, then estimate the p-values along the chromosome. The second approach is instance based. We identify some typical shapes from known sRNA, then use dynamic time warping and fourier trans-form based distance to measure how closely the candidate series matches. We demonstrate that these methods can find known sRNA on Arabidopsis thaliana chromosomes and illustrate the benefits of the added information provided by these algorithms.

[1]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[2]  Eamonn J. Keogh,et al.  Scaling up Dynamic Time Warping to Massive Dataset , 1999, PKDD.

[3]  S. Hammond,et al.  An RNA-directed nuclease mediates post-transcriptional gene silencing in Drosophila cells , 2000, Nature.

[4]  Franck Vazquez,et al.  Endogenous trans-acting siRNAs regulate the accumulation of Arabidopsis mRNAs. , 2004, Molecular cell.

[5]  D. Baulcombe,et al.  Arabidopsis ARGONAUTE1 is an RNA Slicer that selectively recruits microRNAs and short interfering RNAs. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Pamela J Green,et al.  Sweating the small stuff: microRNA discovery in plants. , 2006, Current opinion in biotechnology.

[7]  Dipankar Dasgupta,et al.  Artificial immune systems in industrial applications , 1999, Proceedings of the Second International Conference on Intelligent Processing and Manufacturing of Materials. IPMM'99 (Cat. No.99EX296).

[8]  Junshui Ma,et al.  Online novelty detection on temporal sequences , 2003, KDD '03.

[9]  V. Kim,et al.  Small RNAs : Classification , Biogenesis , and Function , 2005 .

[10]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[11]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[12]  Eamonn J. Keogh,et al.  Finding surprising patterns in a time series database in linear time and space , 2002, KDD.

[13]  Eamonn J. Keogh,et al.  HOT SAX: efficiently finding the most unusual time series subsequence , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[14]  Gareth J. Janacek,et al.  A Likelihood Ratio Distance Measure for the Similarity Between the Fourier Transform of Time Series , 2005, PAKDD.

[15]  L. Lim,et al.  An Abundant Class of Tiny RNAs with Probable Regulatory Roles in Caenorhabditis elegans , 2001, Science.

[16]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[17]  A. Pasquinelli,et al.  A Cellular Function for the RNA-Interference Enzyme Dicer in the Maturation of the let-7 Small Temporal RNA , 2001, Science.

[18]  H. T. Davis,et al.  Estimation of the Innovation Variance of a Stationary Time Series , 1968 .

[19]  Eamonn J. Keogh,et al.  Three Myths about Dynamic Time Warping Data Mining , 2005, SDM.

[20]  Andrew W. Moore,et al.  Bayesian Network Anomaly Pattern Detection for Disease Outbreaks , 2003, ICML.

[21]  G. Janacek Estimation of the minimum mean square error of prediction , 1975 .

[22]  G. J. Janacek,et al.  Practical Time Series , 2001 .

[23]  Todd Blevins,et al.  RNA silencing systems and their relevance to plant development. , 2005, Annual review of cell and developmental biology.