Discovery of Unconventional Patterns for Sequence Analysis: Theory and Algorithms

The pattern discovery task (or equivalently motif inference) is the knowledge discovery process that, given a dataset and some constrains either on the combinatorial pattern structure or on the occurrence lists, returns all the patterns satisfying the given constraints. In this thesis, we consider the problem of discovering patterns in sequential data, such as texts, biological sequences, access logs, etc. When it comes to defining what is a pattern, several classes have been proposed in literature. For example, rigid patterns like p = c◦tc where the don’t care symbol ◦ matches any single character of the input alphabet Σ, or gapped patterns like p = ctt− 2, 3 − tc where the gap represents either a sequence of 2 or 3 don’t cares (we refer to [154] for a thorough discussion of the above classes of patterns). The adjective “unconventional” in the title of this thesis is referred to the unusual combinatorial structure of the patterns we are going to investigate. In fact, while the classic literature of this field focus on string patterns (maybe with wildcards), our line of research explores three different kind of patterns: mask patterns, where each pattern represents a set of string patterns with wildcards, permutation patterns where each pattern is a multiset of characters, and the order of the contained symbols doesn’t matter, and transposons which, roughly speaking, represent the non-conserved regions of a global alignment.

[1]  H. Mannila,et al.  Discovering all most specific sentences , 2003, TODS.

[2]  B. Mcclintock,et al.  Controlling elements and the gene. , 1956, Cold Spring Harbor symposia on quantitative biology.

[3]  P. Mieczkowski,et al.  Recombination between retrotransposons as a source of chromosome rearrangements in the yeast Saccharomyces cerevisiae. , 2006, DNA repair.

[4]  W. J. Kent,et al.  Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. , 2000, Genome research.

[5]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[6]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[7]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[8]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[9]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[10]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[11]  Gianni Franceschini Proximity Mergesort: optimal in-place sorting in the cache-oblivious model , 2004, SODA '04.

[12]  Jeremy Buhler,et al.  Designing patterns for profile HMM search , 2007, Bioinform..

[13]  Roberto Grossi,et al.  Inferring Mobile Elements in S. Cerevisiae Strains , 2011, BIOINFORMATICS.

[14]  Mathieu Raffinot,et al.  An algorithmic view of gene teams , 2004, Theor. Comput. Sci..

[15]  Arthur Chun-Chieh Shih,et al.  GS-Aligner: a novel tool for aligning genomic sequences using bit-level operations. , 2003, Molecular biology and evolution.

[16]  Giorgio Satta,et al.  Efficient text fingerprinting via Parikh mapping , 2003, J. Discrete Algorithms.

[17]  Jens Stoye,et al.  Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences , 2004, CPM.

[18]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[19]  Kellogg S. Booth PQ-tree algorithms. , 1975 .

[20]  Robert Giegerich,et al.  mkESA: enhanced suffix array construction tool , 2009, Bioinform..

[21]  Wen-Lian Hsu PC-Trees vs. PQ-Trees , 2001, COCOON.

[22]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[23]  Gregory Kucherov,et al.  A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[24]  Lucian Ilie,et al.  Fast Computation of Good Multiple Spaced Seeds , 2007, WABI.

[25]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[26]  Kellogg S. Booth,et al.  Testing for the Consecutive Ones Property, Interval Graphs, and Graph Planarity Using PQ-Tree Algorithms , 1976, J. Comput. Syst. Sci..

[27]  David Eppstein,et al.  Sparse dynamic programming I: linear cost functions , 1992, JACM.

[28]  D. Finnegan,et al.  Eukaryotic transposable elements and genome evolution. , 1989, Trends in genetics : TIG.

[29]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[30]  Xin He,et al.  Identifying Conserved Gene Clusters in the Presence of Homology Families , 2005, J. Comput. Biol..

[31]  Maxime Crochemore,et al.  Bases of motifs for generating repeated patterns with wild cards , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  D. R. Fulkerson,et al.  Incidence matrices and interval graphs , 1965 .

[33]  David Sankoff,et al.  Genome rearrangement with gene families , 1999, Bioinform..

[34]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[35]  Esko Ukkonen,et al.  On the complexity of finding gapped motifs , 2008, J. Discrete Algorithms.

[36]  John S Mattick,et al.  Increasing biological complexity is positively correlated with the relative genome-wide expansion of non-protein-coding DNA sequences , 2003, Genome Biology.

[37]  João Meidanis,et al.  On the Consecutive Ones Property , 1998, Discret. Appl. Math..

[38]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[39]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[40]  Jitender S. Deogun,et al.  EMAGEN: An Efficient Approach to Multiple Whole Genome Alignment , 2004, APBC.

[41]  G. Blin,et al.  The breakpoint distance for signed sequences , 2005 .

[42]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[43]  João Meidanis,et al.  Determining DNA Sequence Similarity Using Maximum Independent Set Algorithms for Interval Graphs , 1992, SWAT.

[44]  Salim Haddadi,et al.  Consecutive block minimization is 1.5-approximable , 2008, Inf. Process. Lett..

[45]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[46]  Javier Tamames,et al.  Evolution of gene order conservation in prokaryotes , 2001, Genome Biology.

[47]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[48]  N. Fedoroff,et al.  Transposable Elements As a Molecular Evolutionary Force , 1999, Annals of the New York Academy of Sciences.

[49]  E. Eichler,et al.  A preliminary comparative analysis of primate segmental duplications shows elevated substitution rates and a great-ape expansion of intrachromosomal duplications. , 2006, Genome research.

[50]  F. Crick,et al.  Selfish DNA: the ultimate parasite , 1980, Nature.

[51]  A. Arkin,et al.  The Life-Cycle of Operons , 2006, PLoS genetics.

[52]  Jean Vuillemin,et al.  A data structure for manipulating priority queues , 1978, CACM.

[53]  A. Furano,et al.  Fruit flies and humans respond differently to retrotransposons. , 2002, Current opinion in genetics & development.

[54]  Burkhard Morgenstern,et al.  A space-efficient algorithm for aligning large genomic sequences , 2000, Bioinform..

[55]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[56]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[57]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[58]  Jean-Nicolas Volff,et al.  Transposable elements as drivers of genomic and biological diversity in vertebrates , 2008, Chromosome Research.

[59]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[60]  Cedric Chauve,et al.  Genes Order and Phylogenetic Reconstruction: Application to -Proteobacteria , 2005 .

[61]  Bart Goethals,et al.  Survey on Frequent Pattern Mining , 2003 .

[62]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[63]  Enno Ohlebusch,et al.  An Applications-focused Review of Comparative Genomics Tools: Capabilities, Limitations and Future Challenges , 2003, Briefings Bioinform..

[64]  Nadia El-Mabrouk,et al.  Seed-Based Exclusion Method for Non-coding RNA Gene Search , 2007, COCOON.

[65]  George Havas,et al.  Perfect Hashing , 1997, Theor. Comput. Sci..

[66]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[67]  Leonid Khachiyan,et al.  On the Complexity of Dualization of Monotone Disjunctive Normal Forms , 1996, J. Algorithms.

[68]  Keith R. Oliver,et al.  Transposable elements: powerful facilitators of evolution , 2009, BioEssays : news and reviews in molecular, cellular and developmental biology.

[69]  R. Gregory The evolution of the genome , 2005 .

[70]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[71]  Thomas Blumenthal,et al.  Operons in eukaryotes. , 2004, Briefings in functional genomics & proteomics.

[72]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[73]  Kiem-Phong Vo,et al.  Heaviest Increasing/Common Subsequence Problems , 1992, CPM.

[74]  J. Risler,et al.  Identification of genomic features using microsyntenies of domains: domain teams. , 2005, Genome research.

[75]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[76]  Sanjeev Arora,et al.  Computational Complexity: A Modern Approach , 2009 .

[77]  Dannie Durand,et al.  The Incompatible Desiderata of Gene Cluster Properties , 2005, Comparative Genomics.

[78]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[79]  B. Mcclintock The origin and behavior of mutable loci in maize , 1950, Proceedings of the National Academy of Sciences.

[80]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[81]  Kris Popendorf,et al.  Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes , 2010, PloS one.

[82]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[83]  Lawrence T. Kou,et al.  Polynomial Complete Consecutive Information Retrieval Problems , 1977, SIAM J. Comput..

[84]  Cedric Chauve,et al.  Formal Models of Gene Clusters , 2007 .

[85]  Wojciech Rytter,et al.  Usefulness of the Karp-Miller-Rosenberg Algorithm in Parallel Computations on Strings and Arrays , 1991, Theor. Comput. Sci..

[86]  Wei-Kuan Shih,et al.  A New Planarity Test , 1999, Theor. Comput. Sci..

[87]  Gregory Kucherov,et al.  Subset Seed Automaton , 2007, CIAA.

[88]  Enno Ohlebusch,et al.  The Enhanced Suffix Array and Its Applications to Genome Analysis , 2002, WABI.

[89]  Georg Gottlob,et al.  Computational aspects of monotone dualization: A brief survey , 2008, Discret. Appl. Math..

[90]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[91]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[92]  I King Jordan,et al.  Transposable elements and the evolution of eukaryotic complexity. , 2002, Current issues in molecular biology.

[93]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[94]  R. Ravi,et al.  Nonoverlapping Local Alignments (weighted Independent Sets of Axis-parallel Rectangles) , 1996, Discret. Appl. Math..

[95]  E. Eichler,et al.  Structural Dynamics of Eukaryotic Chromosome Evolution , 2003, Science.

[96]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[97]  H. Kazazian,et al.  Mobile elements and disease. , 1998, Current opinion in genetics & development.

[98]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[99]  Shengzhong Feng,et al.  A fast and flexible approach to oligonucleotide probe design for genomes and gene families , 2007, Bioinform..

[100]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[101]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS '97.

[102]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[103]  Gad M. Landau,et al.  A Combinatorial Approach to Automatic Discovery of Cluster-Patterns , 2003, WABI.

[104]  Michael Brudno,et al.  Fast and sensitive alignment of large genomic sequences , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[105]  Mei Li,et al.  MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences , 2003, Nucleic Acids Res..

[106]  Jens Stoye,et al.  Character sets of strings , 2007, J. Discrete Algorithms.

[107]  Robert P. Davey,et al.  Population genomics of domestic and wild yeasts , 2008, Nature.

[108]  A. Evsikov,et al.  Retrotransposons regulate host genes in mouse oocytes and preimplantation embryos. , 2004, Developmental cell.

[109]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[110]  C. Yanofsky,et al.  The complete nucleotide sequence of the tryptophan operon of Escherichia coli. , 1981, Nucleic acids research.

[111]  Leo Goodstadt,et al.  Phylogenetic Reconstruction of Orthology, Paralogy, and Conserved Synteny for Dog and Human , 2006, PLoS Comput. Biol..

[112]  D. Voytas,et al.  Transposable elements and genome organization: a comprehensive survey of retrotransposons revealed by the complete Saccharomyces cerevisiae genome sequence. , 1998, Genome research.

[113]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[114]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[115]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[116]  Roberto Grossi,et al.  Counting the Orderings for Multisets in Consecutive Ones Property and PQ-Trees , 2011, Developments in Language Theory.

[117]  Roberto Grossi,et al.  Mining Biological Sequences with Masks , 2009, 2009 20th International Workshop on Database and Expert Systems Application.

[118]  Jarek Gryz,et al.  Algorithms and analyses for maximal vector computation , 2007, The VLDB Journal.

[119]  Reuven Bar-Yehuda,et al.  Scheduling split intervals , 2002, SODA '02.

[120]  Esko Ukkonen Structural Analysis of Gapped Motifs of a String , 2007, MFCS.

[121]  J. Bennetzen,et al.  A unified classification system for eukaryotic transposable elements , 2007, Nature Reviews Genetics.

[122]  Laxmi Parida Pattern Discovery in Bioinformatics: Theory & Algorithms , 2007 .

[123]  Hiroki Arimura,et al.  A Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs in a Sequence , 2005, ISAAC.

[124]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[125]  Aleksandar Milosavljevic,et al.  Pash: efficient genome-scale sequence anchoring by Positional Hashing. , 2004, Genome research.

[126]  Sakti P. Ghosh File organization , 1972, Commun. ACM.

[127]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[128]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[129]  Eleazar Eskin,et al.  From profiles to patterns and back again: a branch and bound algorithm for finding near optimal motif profiles , 2004, RECOMB.

[130]  Vladimir Gurvich,et al.  An efficient implementation of a quasi-polynomial algorithm for generating hypergraph transversals and its application in joint generation , 2006, Discret. Appl. Math..

[131]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[132]  D. Hickey Selfish DNA: a sexually-transmitted nuclear parasite. , 1982, Genetics.

[133]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[134]  David Sankoff,et al.  Rearrangements and chromosomal evolution. , 2003, Current opinion in genetics & development.

[135]  Gad M. Landau,et al.  Gene Proximity Analysis across Whole Genomes via PQ Trees1 , 2005, J. Comput. Biol..

[136]  M. MacDonald,et al.  Relationship between trinucleotide repeat expansion and phenotypic variation in Huntington's disease , 1993, Nature Genetics.

[137]  D. Sankoff,et al.  Gene Order Breakpoint Evidence in Animal Mitochondrial Phylogeny , 1999, Journal of Molecular Evolution.

[138]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[139]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[140]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[141]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[142]  S Karlin,et al.  An efficient algorithm for identifying matches with errors in multiple long molecular sequences. , 1991, Journal of molecular biology.

[143]  K. Engel Sperner Theory , 1996 .

[144]  Casey M. Bergman,et al.  Discovering and detecting transposable elements in genome sequences , 2007, Briefings Bioinform..

[145]  D. Haussler,et al.  Ultraconserved Elements in the Human Genome , 2004, Science.

[146]  C. Kuratowski Sur le problème des courbes gauches en Topologie , 1930 .

[147]  Jonathan D. Cohen,et al.  Recursive hashing functions for n-grams , 1997, TOIS.

[148]  Gilles Didier,et al.  Common Intervals of Two Sequences , 2003, WABI.

[149]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[150]  Elias C. Stavropoulos,et al.  Journal of Graph Algorithms and Applications an Efficient Algorithm for the Transversal Hypergraph Generation , 2022 .

[151]  Jens Stoye,et al.  Finding All Common Intervals of k Permutations , 2001, CPM.

[152]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[153]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[154]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[155]  Yuan Gao,et al.  Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm , 2000, SODA '00.

[156]  Daniel H. Huson,et al.  Segment Match Refinement and Applications , 2002, WABI.

[157]  Takeaki Uno,et al.  Fast Algorithms to Enumerate All Common Intervals of Two Permutations , 1997, Algorithmica.

[158]  Roberto Grossi,et al.  Masking patterns in sequences: A new class of motif discovery with don't cares , 2009, Theor. Comput. Sci..

[159]  Laxmi Parida Statistical Significance of Large Gene Clusters , 2007, J. Comput. Biol..

[160]  Mathieu Raffinot,et al.  The Algorithmic of Gene Teams , 2002, WABI.