Novel Computational Methods for Large Scale Genome Comparison

The current wealth of available genomic data provides an unprecedented opportunity to compare and contrast evolutionary histories of closely and distantly related organisms. The focus of this dissertation is on developing novel algorithms and software for efficient global and local comparison of multiple genomes and the application of these methods for a biologically relevant case study. The thesis research is organized into three successive phases, specifically: (1) multiple genome alignment of closely related species, (2) local multiple alignment of interspersed repeats, and finally, (3) a comparative genomics case study of Neisseria. In Phase 1, we first develop an efficient algorithm and data structure for maximal unique match search in multiple genome sequences. We implement these contributions in an interactive multiple genome comparison and alignment tool, M-GCAT, that can efficiently construct multiple genome comparison frameworks in closely related species. In Phase 2, we present a novel computational method for local multiple alignment of interspersed repeats. Our method for local alignment of interspersed repeats features a novel method for gapped extensions of chained seed matches, joining global multiple alignment with a homology test based on a hidden Markov model (HMM). In Phase 3, using the results from the previous two phases we perform a case study of neisserial genomes by tracking the propagation of repeat sequence elements in attempt to understand why the important pathogens of the neisserial group have sexual exchange of DNA by natural transformation. In conclusion, our global contributions in this dissertation have focused on comparing and contrasting evolutionary histories of related organisms via multiple alignment of genomes.

[1]  Xavier Messeguer,et al.  M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species , 2006, BMC Bioinformatics.

[2]  Xavier Messeguer,et al.  Gapped Extension for Local Multiple Alignment of Interspersed DNA Repeats , 2008, ISBRA.

[3]  E. Rocha,et al.  The impact of the neisserial DNA uptake sequences on genome evolution and stability , 2008, Genome Biology.

[4]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[5]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[6]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[7]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[8]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[9]  Xavier Messeguer,et al.  Procrastination Leads to Efficient Filtration for Local Multiple Alignment , 2006, WABI.

[10]  Nikos Kyrpides,et al.  The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide , 2005, Nucleic Acids Res..

[11]  S. Goodman,et al.  Factors influencing the specific interaction of Neisseria gonorrhoeae with transforming DNA , 1991, Journal of bacteriology.