论文信息 - Learning Approximate Sequential Patterns for Classification

Learning Approximate Sequential Patterns for Classification

In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess significance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistics improved the running time of the search algorithm by an order of magnitude without any noticeable effect on accuracy. These results suggest that our methods may allow for an unsupervised approach to efficiently learn interesting dissimilarities between positive and negative examples that may have a functional role.

[1] William Stafford Noble,et al. Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[2] Marion R. Reynolds,et al. A Sequential Signed-Rank Test for Symmetry , 1975 .

[3] Anders Krogh. Hidden Markov models for labeled sequences , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[4] Timothy L. Bailey,et al. Discriminative motif discovery in DNA and protein sequences using the DEME algorithm , 2007, BMC Bioinformatics.

[5] William Noble Grundy,et al. Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[6] Subhabrata Chakraborti,et al. Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[7] David B. Shmoys,et al. A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[8] Philip H. Ramsey. Nonparametric Statistical Methods , 1974, Technometrics.

[9] Douglas L. Brutlag,et al. BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[10] Francis Y. L. Chin,et al. Finding motifs from all sequences with and without binding sites , 2006, Bioinform..

[11] Petra Perner,et al. Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[12] Z. Syed,et al. Risk-stratification following acute coronary syndromes using a novel electrocardiographic technique to measure variability in morphology , 2008, 2008 Computers in Cardiology.

[13] S. Levinson,et al. Considerations in dynamic time warping algorithms for discrete word recognition , 1978 .

[14] S. Salzberg,et al. Alignment of whole genomes. , 1999, Nucleic acids research.

[15] Jill P. Mesirov,et al. Human and mouse gene structure: comparative analysis and application to exon prediction , 2000, RECOMB '00.