Signal processing methods for genomic sequence analysis

Signal processing is the art of representing, transforming, analyzing, and manipulating signals. It deals with a wide range of signals, from speech and audio signals to images and video signals, and many others. Signal processing techniques have been found very useful in diverse applications. Traditional applications include signal enhancement, denoising, speech recognition, audio and image compression, radar signal processing, and digital communications, just to name a few. In recent years, signal processing techniques have been also applied to the analysis of biological data with considerable success. For example, they have been used for predicting protein-coding genes, analyzing ECG signals and MRI data, enhancing and normalizing DNA microarray images, modeling gene regulatory networks, and so forth. In this thesis, we consider the application of signal processing methods to the analysis of biological sequences, especially, DNA and RNA molecules. We demonstrate how conventional signal processing techniques–such as digital filters and filter banks–can contribute to this end, and also show how we can extend the traditional models–such as the hidden Markov models (HMMs)–to better serve this purpose. The first part of the thesis focuses on signal processing methods that can be utilized for analyzing RNA sequences. The primary purposes of this part are to develop a statistical model that is suitable for representing RNA sequence profiles and to propose an effective framework that can be used for finding new homologues (i.e., similar RNAs that are biologically related) of known RNAs. Many functional RNAs have secondary structures that are well conserved among different species. The RNA secondary structure gives rise to long-range correlations between distant bases, which cannot be represented using traditional HMMs. In order to overcome this problem, we propose a new statistical model called the context-sensitive HMM (csHMM). The csHMM is an extension of the traditional HMM, where certain states have variable emission and transition probabilities that depend on the context. The context-sensitive property increases the descriptive power of the model significantly, making csHMMs capable of representing long-range correlations between distant symbols. Based on the proposed model, we present efficient algorithms that can be used for finding the optimal state sequence and computing the probability of an observed symbol string. We also present a training algorithm that can be used for optimizing the parameters of a csHMM. We give several examples that illustrate how csHMMs can be used for modeling various RNA secondary structures and recognizing them. Based on the concept of csHMM, we introduce profile-csHMMs, which are specifically constructed csHMMs that have linear repetitive structures (i.e., state-transition diagrams). Profile-csHMMs are especially useful for building probabilistic representations of RNA sequence families, including pseudoknots. We also propose a dynamic programming algorithm called the sequential component adjoining (SCA) algorithm that can systematically find the optimal state sequence of an observed symbol string based on a profile-csHMM. In order to demonstrate the effectiveness of profile-csHMMs, we build a structural alignment tool for RNA sequences and show that the profile-csHMM approach can yield highly accurate predictions at a relatively low computational cost. At the end, we describe how the profile-csHMM can be used for finding homologous RNAs, and we propose a practical scheme for making the search significantly faster without affecting the prediction accuracy. In the second part of the thesis, we focus on the application of digital filters and filter banks in DNA sequence analysis. Firstly, we demonstrate how we can use digital filters for predicting protein-coding genes. Many coding regions in DNA molecules are known to display a period-3 behavior, which can beeffectively detected using digital filters. Efficient schemes are proposed that can be used for designing such filters. Experimental results will show that the digital filtering approach can clearly identify the coding regions at a very low computational cost. Secondly, we propose a method based on a bank of IIR lowpass filters that can be used for predicting CpG islands, which are specific regions in DNA molecules that are abundant in the dinucleotide CpG. This filter bank is used to process the sequence of log-likelihood ratios obtained from two Markov chains, where the respective Markov chains model the base transition probabilities inside and outside the CpG islands. The locations of the CpG islands are predicted by analyzing the output signals of the filter bank. It will be shown that the filter bank approach can yield reliable prediction results without sacrificing the resolution of the predicted start/end positions of the CpG islands.

[1]  J. Ng,et al.  PseudoBase: a database with RNA pseudoknots , 2000, Nucleic Acids Res..

[2]  A. Bird CpG islands as gene markers in the vertebrate nucleus , 1987 .

[3]  P. P. Vaidyanathan,et al.  Fast Search of Sequences with Complex Symbol Correlations using Profile Context-Sensitive HMMS and Pre-Screening Filters , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[5]  E Rivas,et al.  A dynamic programming algorithm for RNA structure prediction including pseudoknots. , 1998, Journal of molecular biology.

[6]  G. Storz An Expanding Universe of Noncoding RNAs , 2002, Science.

[7]  Yasubumi Sakakibara,et al.  Pair hidden Markov models on tree structures , 2003, ISMB.

[8]  M. Hentze,et al.  Molecular control of vertebrate iron metabolism: mRNA-based regulatory circuits operated by iron, nitric oxide, and oxidative stress. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Jeffrey E. Barrick,et al.  Metabolite-binding RNA domains are present in the genes of eukaryotes. , 2003, RNA.

[10]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[11]  Georgios B. Giannakis,et al.  Signal Processing Advances in Wireless and Mobile Communications, Volume 2: Trends in Single- and Multi-User Systems , 2000 .

[12]  Weixiong Zhang,et al.  An Iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots , 2004, Bioinform..

[13]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[14]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[15]  Byung-Jun Yoon,et al.  Scoring Algorithm for Context-Sensitive HMMs with Application to RNA Secondary Structure Analysis , 2005 .

[16]  P. Vaidyanathan,et al.  The digital all-pass filter: a versatile signal processing building block , 1988, Proc. IEEE.

[17]  P. Vaidyanathan Genomics and proteomics: a signal processor's tour , 2004, IEEE Circuits and Systems Magazine.

[18]  S. Gottesman,et al.  Stealth regulation: biological circuits with small RNA switches. , 2002, Genes & development.

[19]  Y. Guédon Estimating Hidden Semi-Markov Chains From Discrete Sequences , 2003 .

[20]  Sheldon M. Ross Introduction to Probability Models. , 1995 .

[21]  Andrej Ljolje,et al.  High accuracy phone recognition using context clustering and quasi-triphonic models , 1994, Comput. Speech Lang..

[22]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[23]  C. Yanofsky,et al.  Regulation by transcription attenuation in bacteria: how RNA provides instructions for transcription termination/antitermination decisions. , 2002, BioEssays : news and reviews in molecular, cellular and developmental biology.

[24]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[25]  S. Mitra,et al.  Interpolated finite impulse response filters , 1984 .

[26]  Peter F Stadler,et al.  Fast and reliable prediction of noncoding RNAs , 2005, Proc. Natl. Acad. Sci. USA.

[27]  P. P. Va,et al.  Digital filters for gene prediction applications , 2002 .

[28]  James A. Birchler,et al.  RNAi-mediated pathways in the nucleus , 2005, Nature Reviews Genetics.

[29]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[30]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[31]  Touradj Ebrahimi,et al.  The JPEG 2000 still image compression standard , 2001, IEEE Signal Process. Mag..

[32]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[33]  P. P. Vaidyanathan,et al.  Structural Alignment of RNAs Using Profile-csHMMs and Its Application to RNA Homology Search: Overview and New Results , 2008, IEEE Transactions on Automatic Control.

[34]  R. Breaker,et al.  Gene regulation by riboswitches , 2004, Nature Reviews Molecular Cell Biology.

[35]  Noam Chomsky,et al.  On Certain Formal Properties of Grammars , 1959, Inf. Control..

[36]  D. Bartel MicroRNAs Genomics, Biogenesis, Mechanism, and Function , 2004, Cell.

[37]  H. Prydz,et al.  CpG islands as gene markers in the human genome. , 1992, Genomics.

[38]  Byung-Jun Yoon,et al.  RNA secondary structure prediction using context-sensitive hidden Markov models , 2004 .

[39]  Byung-Jun Yoon,et al.  An overview of the role of context-sensitive HMMS in the prediction of NCRNA genes , 2005, IEEE/SP 13th Workshop on Statistical Signal Processing, 2005.

[40]  Sean R. Eddy,et al.  RSEARCH: Finding homologs of single structured RNA sequences , 2003, BMC Bioinformatics.

[41]  P. Stadler,et al.  Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome , 2005, Nature Biotechnology.

[42]  E. Trifonov,et al.  The pitch of chromatin DNA is reflected in its nucleotide sequence. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Peter A. Jones,et al.  Cancer-epigenetics comes of age , 1999, Nature Genetics.

[44]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[45]  E.R. Dougherty,et al.  Research issues in genomic signal processing , 2005, IEEE Signal Processing Magazine.

[46]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[47]  P. P. Vaidyanathan,et al.  GENE AND EXON PREDICTION USING ALLPASS-BASED FILTERS , 2002 .

[48]  Aniruddha Datta,et al.  Genomic signal processing: diagnosis and therapy , 2005, IEEE Signal Process. Mag..

[49]  Michael T. McManus,et al.  Gene silencing in mammals by small interfering RNAs , 2002, Nature Reviews Genetics.

[50]  S.N. Tandon,et al.  Using wavelet transforms for ECG characterization. An on-line digital signal processing system , 1997, IEEE Engineering in Medicine and Biology Magazine.

[51]  Byung-Jun Yoon,et al.  Context-Sensitive Hidden Markov Models for Modeling Long-Range Dependencies in Symbol Sequences , 2006, IEEE Transactions on Signal Processing.

[52]  Georgios B. Giannakis,et al.  Identifying differentially expressed genes in microarray experiments with model-based variance estimation , 2006, IEEE Transactions on Signal Processing.

[53]  Anne Condon,et al.  Classifying RNA pseudoknotted structures , 2004, Theor. Comput. Sci..

[54]  A. Bairoch,et al.  PROSITE: recent developments. , 1994, Nucleic acids research.

[55]  Lin He,et al.  MicroRNAs: small RNAs with a big role in gene regulation , 2004, Nature reviews genetics.

[56]  P. Vaidyanathan Multirate Systems And Filter Banks , 1992 .

[57]  Diego di Bernardo,et al.  ddbRNA: detection of conserved secondary structures in multiple alignments , 2003, Bioinform..

[58]  Zasha Weinberg,et al.  Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy , 2004, ISMB/ECCB.

[59]  S. Haykin Radar signal processing , 1985, IEEE ASSP Magazine.

[60]  A. Bird DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[61]  Hiroshi Matsui,et al.  Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[62]  P. P. Vaidyanathan,et al.  Optimal alignment algorithm for context-sensitive hidden Markov models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[63]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[64]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[65]  Gary Ruvkun,et al.  Glimpses of a Tiny RNA World , 2001, Science.

[66]  Modeling and identification of alternative folding in regulatory RNAs using context-sensitive HMMS , 2006, 2006 IEEE International Workshop on Genomic Signal Processing and Statistics.

[67]  S. Eddy Computational Genomics of Noncoding RNA Genes , 2002, Cell.

[68]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[69]  Elena Rivas,et al.  Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs , 2000, Bioinform..

[70]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[71]  P. Noll,et al.  MPEG digital audio coding , 1997, IEEE Signal Process. Mag..

[72]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[73]  Georgios B. Giannakis,et al.  Signal processing advances in wireless and mobile communications , 2000, IEEE Signal Process. Mag..

[74]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[75]  Byung-Jun Yoon,et al.  HMM with auxiliary memory: a new tool for modeling RNA structures , 2004, Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004..

[76]  Sean R Eddy,et al.  How do RNA folding algorithms work? , 2004, Nature Biotechnology.

[77]  Wojciech Pieczynski,et al.  Pairwise Markov Chains , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[78]  H. Kobayashi,et al.  An efficient forward-backward algorithm for an explicit-duration hidden Markov model , 2003, IEEE Signal Processing Letters.

[79]  L. Carin,et al.  Sequential modeling for identifying CpG island locations in human genome , 2002, IEEE Signal Processing Letters.

[80]  Suprakash Datta,et al.  DFT based DNA splicing algorithms for prediction of protein coding regions , 2004, Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004..

[81]  Emmanuel J. Candès,et al.  The curvelet transform for image denoising , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[82]  C. Li,et al.  Detection of ECG characteristic points using wavelet transforms. , 1995, IEEE transactions on bio-medical engineering.

[83]  Graziano Pesole,et al.  PatSearch: a pattern matcher software that finds functional elements in nucleotide and protein sequences and assesses their statistical significance , 2000, Bioinform..

[84]  B. Wang,et al.  Correlation property of length sequences based on global structure of the complete genome. , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[85]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[86]  P. P. Vaidyanathan,et al.  Profile Context-Sensitive HMMs for Probabilistic Modeling of Sequences With Complex Correlations , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[87]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[88]  R. Feil,et al.  Genomic imprinting in mammals: an interplay between chromatin and DNA methylation? , 1999, Trends in genetics : TIG.

[89]  V. Moulton Tracking down noncoding RNAs. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[90]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[91]  Elena Rivas,et al.  The language of RNA: a formal grammar that includes pseudoknots , 2000, Bioinform..

[92]  Jamal Tuqan,et al.  Gene Identification Using the Z-Curve Representation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[93]  Michael A. Harrison,et al.  Introduction to formal language theory , 1978 .

[94]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[95]  Martin Vetterli,et al.  Adaptive wavelet thresholding for image denoising and compression , 2000, IEEE Trans. Image Process..

[96]  B. Berger,et al.  MSARI: multiple sequence alignments for statistical detection of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[97]  J. Mattick Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. , 2003, BioEssays : news and reviews in molecular, cellular and developmental biology.

[98]  R. Breaker,et al.  Riboswitches as versatile gene control elements. , 2005, Current opinion in structural biology.

[99]  Travis S. Bayer,et al.  Programmable ligand-controlled riboregulators of eukaryotic gene expression , 2005, Nature Biotechnology.

[100]  A. Fire,et al.  Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans , 1998, Nature.

[101]  R. Mazzarella,et al.  Computational Detection of CpG Islands in DNA , 1997 .

[102]  N. Blackstone Essential Cell Biology: An Introduction to the Molecular Biology of the Cell.Bruce Alberts , Dennis Bray , Alexander Johnson , Julian Lewis , Martin Raff , Keith Roberts , Peter Walter , 1998 .

[103]  Zasha Weinberg,et al.  Sequence-based heuristics for faster annotation of non-coding RNA families , 2006, Bioinform..

[104]  Graziano Pesole,et al.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs , 2000, Nucleic Acids Res..

[105]  Michael I. Savic,et al.  Speech reconstruction using a generalized HSMM (GHSMM) , 2004, Digit. Signal Process..

[106]  V. Chechetkin,et al.  Size-dependence of three-periodicity and long-range correlations in DNA sequences , 1995 .

[107]  J. Herman,et al.  Alterations in DNA methylation: a fundamental aspect of neoplasia. , 1998, Advances in cancer research.

[108]  Richard M. Schwartz,et al.  Improved hidden Markov modeling of phonemes for continuous speech recognition , 1984, ICASSP.

[109]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[110]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[111]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[112]  Sylvain Faisan,et al.  Unsupervised learning and mapping of active brain functional MRI signals based on hidden semi-Markov event sequence models , 2005, IEEE Transactions on Medical Imaging.

[113]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[114]  V. Zhurkin,et al.  Periodicity in DNA primary structure is defined by secondary structure of the coded protein. , 1981, Nucleic acids research.

[115]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[116]  Babak Hassibi,et al.  A statistical model for microarrays, optimal estimation algorithms, and limits of performance , 2006, IEEE Transactions on Signal Processing.

[117]  Maciej Szymanski,et al.  The non-coding RNAs as riboregulators , 2001, Nucleic Acids Res..

[118]  Wojciech Pieczynski,et al.  Triplet Markov chains in hidden signal restoration , 2003, SPIE Remote Sensing.

[119]  Edward R. Dougherty,et al.  From Boolean to probabilistic Boolean networks as models of genetic regulatory networks , 2002, Proc. IEEE.

[120]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1992 .

[121]  Yann Guédon,et al.  Hidden hybrid Markov/semi-Markov chains , 2005, Comput. Stat. Data Anal..

[122]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[123]  P. P. Vaidyanathan,et al.  The role of signal-processing concepts in genomics and proteomics , 2004, J. Frankl. Inst..

[124]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[125]  W D Wilson,et al.  Evidence for the existence of a pseudoknot structure at the 3' terminus of the flavivirus genomic RNA. , 1996, Biochemistry.

[126]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[127]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[128]  E. Dam,et al.  Structural and functional aspects of RNA pseudoknots. , 1992, Biochemistry.

[129]  P. Vaidyanathan,et al.  Identification of CpG islands using a bank of IIR lowpass filters [DNA sequence detection] , 2004, 3rd IEEE Signal Processing Education Workshop. 2004 IEEE 11th Digital Signal Processing Workshop, 2004..

[130]  Daiya Takai,et al.  Comprehensive analysis of CpG islands in human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[131]  P. P. Vaidyanathan,et al.  Computational and Analysis of Noncoding RNAs , 2007 .

[132]  Gary D. Stormo,et al.  Finding Common Sequence and Structure Motifs in a Set of RNA Sequences , 1997, ISMB.