An index structure for pattern similarity searching in DNA microarray data

DNA microarray technology is about to bring an explosion of gene expression data that may dwarf even the human sequencing projects. Researchers are motivated to identify genes whose expression levels rise and fall coherently under a set of experimental perturbations, that is, they exhibit fluctuation of a similar shape when conditions change. In this paper, we show that queries based on pattern correlations against large-scale microarray databases can be supported by the weighted-sequence model, an index structure designed for sequence matching. A weighted-sequence is a two-dimensional structure where each element in the sequence is associated with a weight. We transform the DNA microarray data, as well as pattern-based queries, into weighted-sequences, and use subsequence matching algorithms to retrieve from the database all genes that match the query pattern. We demonstrate, using both synthetic and real-world data sets, that our method is effective and efficient.

[1]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[3]  Roberto Grossi,et al.  Fast string searching in secondary storage: theoretical developments and experimental results , 1996, SODA '96.

[4]  Divesh Srivastava,et al.  On effective multi-dimensional indexing for strings , 2000, SIGMOD '00.

[5]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[6]  Graham Cameron,et al.  One-stop shop for microarray data , 2000, Nature.

[7]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[8]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[9]  M B Eisen,et al.  Delineating developmental and metabolic pathways in vivo by expression profiling using the RIKEN set of 18,816 full-length enriched mouse cDNA arrays , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Philip S. Yu,et al.  Indexing weighted-sequences in large databases , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[11]  Alison Abbott Bioinformatics institute plans public database for gene expression data , 1999, Nature.

[12]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[13]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[14]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[15]  Haixun Wang,et al.  Landmarks: a new model for similarity-based pattern querying in time series databases , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[16]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[17]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[18]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[19]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[20]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.