Near-Neighbor Search in Pattern Distance Spaces

In this paper, we study the near-neighbor problem based on pattern similarity, a new type of similarity which conventional distance metrics such as Lp norm cannot model effectively. The problem, however, is important to many applications. For example, in DNA microarray analysis, the expression levels of two closely related genes may rise and fall under different external conditions or at different time. Although the magnitude of their expression levels may not be close, the patterns they exhibit over the time or under different conditions can be very similar. In this paper, we measure the distance between two objects by pattern similarity, i.e., whether the two objects exhibit a synchronous pattern of rise and fall under different conditions. We then present an efficient algorithm for near-neighbor search based on pattern similarity, and we perform tests on several real and synthetic data sets to show its effectiveness.

[1]  Piotr Indyk On approximate nearest neighbors in non-Euclidean spaces , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[2]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[3]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[4]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[5]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[6]  M B Eisen,et al.  Delineating developmental and metabolic pathways in vivo by expression profiling using the RIKEN set of 18,816 full-length enriched mouse cDNA arrays , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Philip S. Yu,et al.  Indexing weighted-sequences in large databases , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[8]  Philip S. Yu,et al.  A fast algorithm for subspace clustering by pattern similarity , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..