Signal detection in genome sequences using complexity based features

In this work, we tackle the problem of evaluating complexity methods and measures for finding interesting signals in the whole genome of three prokaryotic organisms. In addition to previous complexity measures, new measures are introduced for representing Open Reading Frames (ORF). We apply different classification algorithms to determine which complexity measure results in better predictive performance in discriminating genes from pseudo-genes in ORFs. Also, we investigate whether positions and lengths of windows in ORFs have significant impact on distinguishing between genes and pseudo-genes. Different classification algorithms are applied for classifying ORFs into genes and pseudo-genes.

[1]  Suprakash Datta,et al.  Evolved Features for DNA Sequence Classification and Their Fitness Landscapes , 2013, IEEE Transactions on Evolutionary Computation.

[2]  Xiangji Huang,et al.  Diverging patterns: discovering significant frequency change dissimilarities in large databases , 2009, CIKM.

[3]  Jiawei Han,et al.  Data Mining: Concepts and Techniques, Second Edition , 2006, The Morgan Kaufmann series in data management systems.

[4]  Changiz Eslahchi,et al.  The performances of the chi-square test and complexity measures for signal recognition in biological sequences. , 2008, Journal of theoretical biology.

[5]  GusfieldDan Introduction to the IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2004 .

[6]  Suprakash Datta,et al.  Distinguishing Endogenous Retroviral LTRs from SINE Elements Using Features Extracted from Evolved Side Effect Machines , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Aijun An,et al.  Evaluation of different complexity measures for signal detection in genome sequences , 2010, BCB '10.

[8]  Changiz Eslahchi,et al.  A tale of two symmetrical tails: Structural and functional characteristics of palindromes in proteins , 2008, BMC Bioinformatics.

[9]  Chrystopher L. Nehaniv,et al.  Discriminating coding, non-coding and regulatory regions using rescaled range and detrended fluctuation analysis , 2008, Biosyst..

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Donald A. Adjeroh,et al.  On complexity measures for biological sequences , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[12]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[13]  V. Benci,et al.  Data compression and genomes: a two-dimensional life domain map. , 2008, Journal of theoretical biology.

[14]  Yuriy L. Orlov,et al.  Complexity: an internet resource for analysis of DNA sequence complexity , 2004, Nucleic Acids Res..

[15]  R. K. Tetuev,et al.  Analytical recognition methods for repeated structures in genomes , 2006 .

[16]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[17]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[18]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[19]  Gad M. Landau,et al.  Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity , 2002, Bioinform..

[20]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .