A tree-based approach for motif discovery and sequence classification

MOTIVATION Pattern discovery algorithms are widely used for the analysis of DNA and protein sequences. Most algorithms have been designed to find overrepresented motifs in sparse datasets of long sequences, and ignore most positional information. We introduce an algorithm optimized to exploit spatial information in sparse-but-populous datasets. RESULTS Our algorithm Tree-based Weighted-Position Pattern Discovery and Classification (T-WPPDC) supports both unsupervised pattern discovery and supervised sequence classification. It identifies positionally enriched patterns using the Kullback-Leibler distance between foreground and background sequences at each position. This spatial information is used to discover positionally important patterns. T-WPPDC then uses a scoring function to discriminate different biological classes. We validated T-WPPDC on an important biological problem: prediction of single nucleotide polymorphisms (SNPs) from flanking sequence. We evaluated 672 separate experiments on 120 datasets derived from multiple species. T-WPPDC outperformed other pattern discovery methods and was comparable to the supervised machine learning algorithms. The algorithm is computationally efficient and largely insensitive to dataset size. It allows arbitrary parameterization and is embarrassingly parallelizable. CONCLUSIONS T-WPPDC is a minimally parameterized algorithm for both pattern discovery and sequence classification that directly incorporates positional information. We use it to confirm the predictability of SNPs from flanking sequence, and show that positional information is a key to this biological problem. AVAILABILITY The algorithm, code and data are available at: http://www.cs.utoronto.ca/~juris/data/TWPPDC

[1]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[2]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  Philip Machanick,et al.  The value of position-specific priors in motif discovery using MEME , 2010, BMC Bioinformatics.

[5]  Deborah A. Siegele,et al.  MOPAC: MOtif Finding by Preprocessing and Agglomerative Clustering from Microarrays , 2003, Pacific Symposium on Biocomputing.

[6]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[7]  S Kullback,et al.  LETTER TO THE EDITOR: THE KULLBACK-LEIBLER DISTANCE , 1987 .

[8]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[9]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[10]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[11]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[12]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[13]  David G. Stork,et al.  Pattern Classification , 1973 .

[14]  Ze Yang,et al.  Association between the RAGE G82S polymorphism and Alzheimer’s disease , 2009, Journal of Neural Transmission.

[15]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[16]  R. Shamir,et al.  Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. , 2008, Genome research.

[17]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[18]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[19]  Mathieu Lemire,et al.  Common variants in the NLRP3 region contribute to Crohn's disease susceptibility , 2009, Nature Genetics.

[20]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[21]  J. Fickett,et al.  Identification of regulatory regions which confer muscle-specific gene expression. , 1998, Journal of molecular biology.

[22]  Zhongming Zhao,et al.  Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. , 2003, Gene.

[23]  Steven Gallinger,et al.  Meta-analysis of genome-wide association data identifies four new susceptibility loci for colorectal cancer , 2008, Nature Genetics.

[24]  References , 1971 .

[25]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[26]  Rui Yan,et al.  Comparison of Machine Learning and Pattern Discovery Algorithms for the Prediction of Human Single Nucleotide Polymorphisms , 2007, 2007 IEEE International Conference on Granular Computing (GRC 2007).

[27]  Zhongming Zhao,et al.  The influence of neighboring-nucleotide composition on single nucleotide polymorphisms (SNPs) in the mouse genome and its comparison with human SNPs. , 2004, Genomics.

[28]  Ankush Mittal,et al.  Localized motif discovery in gene regulatory sequences , 2010, Bioinform..

[29]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[30]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[31]  Martin S. Taylor,et al.  The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line , 2009, Nature Genetics.

[32]  Patricio Yankilevich,et al.  Evaluating HapMap SNP data transferability in a large-scale genotyping project involving 175 cancer-associated genes , 2006, Human Genetics.