论文信息 - CSAM: Using Clustering-Hashing-Signal Anchoring Method to Explore Human Novel Genes

CSAM: Using Clustering-Hashing-Signal Anchoring Method to Explore Human Novel Genes

The expression of genes in mammalian cells can be constitutive, transient, or inducible. Transcripts of transient and inducible genes are difficult to discover using the EST approach. Transiently expressed genes, however, are crucial to embryo development and the pathogenesis of disease because they determine the outcome of disease. Using our new bioinformatics approach, which we believe will facilitate verification of novel transcripts in developing embryos or pathogen-induced cells; we aimed to identify novel exons in transiently expressed genes. First of all, the proposed method uses a general gene predictor that must be able to produce all possibly optimal or suboptimal candidate exons in human. After applying signal processing, an anchoring procedure in the method transforms and groups the candidate sequences into many numeric hashing-signals clusters rapidly. In the meanwhile, an entropy-based theorem in the method can be used to remove the most error matches, repeat matches. Finally, the method generates the resulting exons identified by alignment with other genomic or EST sequence in cross-species. Our results indicated that 3,223 filtered target exons were potential novel exons. The theoretical threshold determined using the computational method for filtering repeat matches had 95.3% sensitivity and 81.8% specificity. The inferential threshold, however, was close to the experimental threshold, which is a practical expected value for considering both sensitivity and specificity. Therefore, our results proved the feasibility of the method. Combining the anchoring method embedded an entropy-based filter with an inherently unreliable gene predictor can be used to obtain a small scope of exons that may be potentially novel because the combination avoids many drawbacks of some traditional gene predictors.

Yueh-Min Huang | Chun-Min Hung | Ming-Shi Chang

[1] R. Guigó,et al. SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[2] Ian Korf,et al. Integrating genomic homology into gene structure prediction , 2001, ISMB.

[3] L. Pachter,et al. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[4] Trees-Juen Chuang,et al. A comparative method for identification of gene structures and alternatively spliced variants , 2004, Bioinform..

[5] R. Guigó,et al. An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[6] G. Krishna,et al. Agglomerative clustering using the concept of mutual nearest neighbourhood , 1978, Pattern Recognit..

[7] B. Berger,et al. Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[8] Nicholas L. Bray,et al. AVID: A global alignment program. , 2003, Genome research.

[9] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[10] Przemyslaw Prusinkiewicz,et al. An algorithm for multidimensional data clustering , 1988, TOMS.

[11] S. Karlin,et al. Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[12] Andrea Califano,et al. Analysis of Gene Expression Microarrays for Phenotype Classification , 2000, ISMB.

[13] Ian Korf,et al. MaskerAid : a performance enhancement to RepeatMasker , 2000, Bioinform..

[14] A. Nekrutenko,et al. Transposable elements are found in a large number of human protein-coding genes. , 2001, Trends in genetics : TIG.