Identifying CIS-regulatory modules by an alignment-free statistic D2S

Identifying cis-regulatory modules (CRMs) is one of the important challenges in molecular biology, and the current computing methods are still the main way to find CRMs. However, these methods generally have a problem of high false positive rate, and one of the ways to reduce the false positive rate is the parameter optimization. Overcoming the deficiency of traditional CRMs identification methods, an alignment-free statistic is proposed to predict the site of CRMs which is called D2S statistic. At the same time, two other statistics (D2 and D2star) are also proposed for comparison. The result shows that the accuracy of D2S is best in the three statistics for the different parameters k (k-tuple length value) and Markov order M. D2S performs very well when k is equal to 7 and M is equal to 1 by adjusting the parameters k and M according to the AUC curve. Thus statistic D2S can be used to predict the sites of CRMs so as to increase the sensitivity and specificity of predictive software for CRMs.

[1]  Xin He,et al.  MORPH: Probabilistic Alignment Combined with Hidden Markov Models of cis-Regulatory Modules , 2007, PLoS Comput. Biol..

[2]  Haluk Resat,et al.  Combining microarray and genomic data to predict DNA binding motifs. , 2005, Microbiology.

[3]  Bart De Moor,et al.  Computational detection of cis-regulatory modules , 2003, ECCB.

[4]  Sarah A. Teichmann,et al.  Assessing Computational Methods of Cis-Regulatory Module Prediction , 2010, PLoS Comput. Biol..

[5]  Pierre Baldi,et al.  The Biology of Eukaryotic Promoter Prediction - A Review , 1999, Comput. Chem..

[6]  Gesine Reinert,et al.  The Power of Detecting Enriched Patterns: An HMM Approach , 2010, J. Comput. Biol..

[7]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[8]  Francesca Chiaromonte,et al.  Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. , 2005, Genome research.

[9]  J. Fickett,et al.  Identification of regulatory regions which confer muscle-specific gene expression. , 1998, Journal of molecular biology.

[10]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Gesine Reinert,et al.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model. , 2011, Journal of theoretical biology.

[12]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[13]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[14]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[15]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[16]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[17]  Minghua Deng,et al.  Comparison of metagenomic samples using sequence signatures , 2012, BMC Genomics.

[18]  Jay D Keasling,et al.  Transcriptomic and proteomic analyses of Desulfovibrio vulgaris biofilms: Carbon and energy flow contribute to the distinct biofilm growth state , 2012, BMC Genomics.