Statistical learning and data mining in biological databases

This thesis explores (i) the feasibility of using communication theory models to understand the protein synthesis process from gene to protein, (ii) to find the genetic error control mechanism using error correcting coding theory and (iii) detecting diseases related genetic errors using statistical learning methods on biological databases i.e., EST(Expressed Sequence Tag) and SNP(Single Nucleotide Polymorphism). Several statistical tests are proposed and tested over various biological data. These include the CUSUM (Cumulative Sum) detection for abrupt changes in a stochastic process, SVD(Singular Value Decomposition) for dimensionality reduction and HMM-SVM(Hidden Markov Model-Support Vector Machine). We propose new disease diagnosis systems based on Gene Variation Analysis. The system consist of Pre-Processing, Similarity Search and clustering by EST analysis and disease analysis by SNP classification. Pre-processing reduces the overall noise (vector contamination, low complexity region, repeats) in EST data to improve the efficacy of subsequent analysis. EST clustering and assembly using CAP3 sequence assembly is used to collect overlapping ESTs from the same transcript to reduce redundancy. The assembled EST called Consensus EST sequences are merged based on clone-identification data to obtain the best putative gene representation. Detailed test results on several biological databases are used to draw key conclusions about the proposed mathematical analyses.