Meta-analytical biomarker search of EST expression data reveals three differentially expressed candidates

BackgroundResearches have been conducted for the identification of differentially expressed genes (DEGs) by generating and mining of cDNA expressed sequence tags (ESTs) for more than a decade. Although the availability of public databases make possible the comprehensive mining of DEGs among the ESTs from multiple tissue types, existing studies usually employed statistics suitable only for two categories. Multi-class test has been developed to enable the finding of tissue specific genes, but subsequent search for cancer genes involves separate two-category test only on the ESTs of the tissue of interest. This constricts the amount of data used. On the other hand, simple pooling of cancer and normal genes from multiple tissue types runs the risk of Simpson's paradox. Here we presented a different approach which searched for multi-cancer DEG candidates by analyzing all pertinent ESTs in all categories and narrowing down the cancer biomarker candidates via integrative analysis with microarray data and selection of secretory and membrane protein genes as well as incorporation of network analysis. Finally, the differential expression patterns of three selected cancer biomarker candidates were confirmed by real-time qPCR analysis.ResultsSeven hundred and twenty three primary DEG candidates (p-value < 0.05 and lower bound of confidence interval of odds ratio ≧ 1.65) were selected from a curated EST database with the application of Cochran-Mantel-Haenszel statistic (CMH). GeneGO analysis results indicated this set as neoplasm enriched. Cross-examination with microarray data further narrowed the list down to 235 genes, among which 96 had membrane or secretory annotations. After examined the candidates in protein interaction network, public tissue expression databases, and literatures, we selected three genes for further evaluation by real-time qPCR with eight major normal and cancer tissues. The higher-than-normal tissue expression of COL3A1, DLG3, and RNF43 in some of the cancer tissues is in agreement with our in silico predictions.ConclusionsSearching digitized transcriptome using CMH enabled us to identify multi-cancer differentially expressed gene candidates. Our methodology demonstrated simultaneously analysis for cancer biomarkers of multiple tissue types with the EST data. With the revived interest in digitizing the transcriptomes by NGS, cancer biomarkers could be more precisely detected from the ESTs. The three candidates identified in this study, COL3A1, DLG3, and RNF43, are valuable targets for further evaluation with a larger sample size of normal and cancer tissue or serum samples.

[1]  K. Buetow,et al.  Computational analysis and experimental validation of tumor-associated alternative RNA splicing in human cancer. , 2003, Cancer research.

[2]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[3]  Nicholas A. Hamilton,et al.  LOCATE: a mammalian protein subcellular localization database , 2007, Nucleic Acids Res..

[4]  S Greenland,et al.  Interpretation and estimation of summary ratios under heterogeneity. , 1982, Statistics in medicine.

[5]  J. Cushman,et al.  Identification of tissue-specific, abiotic stress-responsive gene expression patterns in wine grape (Vitis vinifera L.) based on curation and mining of large-scale EST data sets , 2011, BMC Plant Biology.

[6]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[7]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[8]  Benjamin M. Bolstad,et al.  affy - analysis of Affymetrix GeneChip data at the probe level , 2004, Bioinform..

[9]  Z. Fei,et al.  Analysis of expressed sequence tags generated from full-length enriched cDNA libraries of melon , 2011, BMC Genomics.

[10]  G. Rücker,et al.  Simpson's paradox visualized: The example of the Rosiglitazone meta-analysis , 2008, BMC medical research methodology.

[11]  J. Claverie,et al.  Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. , 1999, Genome research.

[12]  Christopher J. Lee,et al.  Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. , 2002, Nucleic acids research.

[13]  W. Wong,et al.  A gene signature predictive for outcome in advanced ovarian cancer identifies a survival factor: microfibril-associated glycoprotein 2. , 2009, Cancer cell.

[14]  Sean R. Davis,et al.  GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor , 2007, Bioinform..

[15]  Liviu Badea,et al.  Combined gene expression analysis of whole-tissue and microdissected pancreatic ductal adenocarcinoma identifies genes specifically overexpressed in tumor epithelia. , 2008, Hepato-gastroenterology.

[16]  D. Nie,et al.  Molecular cloning and characterization of a novel human testis-specific gene by use of digital differential display , 2006, Journal of Genetics.

[17]  Ben Bolstad,et al.  Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization , 2003 .

[18]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[19]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[20]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[21]  Yves Van de Peer,et al.  In situ analysis of cross-hybridisation on microarrays and the inference of expression correlation , 2007, BMC Bioinformatics.

[22]  Andrei L. Lomize,et al.  OPM: Orientations of Proteins in Membranes database , 2006, Bioinform..

[23]  Lilya V. Matyunina,et al.  Gene expression profiling supports the hypothesis that human ovarian surface epithelia are multipotent and capable of serving as ovarian cancer initiating cells , 2009, BMC Medical Genomics.

[24]  Yixue Li,et al.  Identification of alternatively spliced mRNA variants related to cancers by genome-wide ESTs alignment , 2004, Oncogene.

[25]  C. Pilarsky,et al.  Exhaustive mining of EST libraries for genes differentially expressed in normal and tumour tissues. , 1999, Nucleic acids research.

[26]  Mahlon D. Johnson,et al.  Vascular gene expression patterns are conserved in primary and metastatic brain tumors , 2010, Journal of Neuro-Oncology.

[27]  Zhixiang Zuo,et al.  A Global View of Cancer-Specific Transcript Variants by Subtractive Transcriptome-Wide Analysis , 2009, PloS one.

[28]  Yong Zhang,et al.  SPD—a web-based secreted protein database , 2004, Nucleic Acids Res..

[29]  I. Wilson,et al.  An efficient approach to finding Siraitia grosvenorii triterpene biosynthetic genes by RNA-seq and digital gene expression analysis , 2011, BMC Genomics.

[30]  J. R. Landis,et al.  A general overview of Mantel-Haenszel methods: applications and recent developments. , 1988, Annual review of public health.

[31]  P. Raman,et al.  The Membrane Protein Data Bank , 2005, Cellular and Molecular Life Sciences.

[32]  David States,et al.  Selecting for functional alternative splices in ESTs. , 2002, Genome research.

[33]  I. Pastan,et al.  Discovery of three genes specifically expressed in human prostate by expressed sequence tag database analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Christopher J. Lee,et al.  Discovery of novel splice forms and functional analysis of cancer-specific alternative splicing in human expressed sequences. , 2003, Nucleic acids research.

[35]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[36]  S. P. Fodor,et al.  Light-generated oligonucleotide arrays for rapid DNA sequence analysis. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[37]  W. Haenszel,et al.  Statistical aspects of the analysis of data from retrospective studies of disease. , 1959, Journal of the National Cancer Institute.

[38]  Georg N Duda,et al.  Composite transcriptome assembly of RNA-seq data in a sheep model for delayed bone healing , 2011, BMC Genomics.

[39]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[40]  F F Costa,et al.  Shotgun sequencing of the human transcriptome with ORF expressed sequence tags. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[41]  J. Claverie,et al.  The significance of digital gene expression profiles. , 1997, Genome research.

[42]  L. Haeberle,et al.  Expression of Collagen Types I, II and III in Juvenile Angiofibromas , 2008, Cells Tissues Organs.

[43]  K. Dhaene,et al.  Claudin-18 Splice Variant 2 Is a Pan-Cancer Target Suitable for Therapeutic Antibody Development , 2008, Clinical Cancer Research.

[44]  D. Tai,et al.  Candidate Serological Biomarkers for Cancer Identified from the Secretomes of 23 Cancer Cell Lines and the Human Protein Atlas* , 2010, Molecular & Cellular Proteomics.

[45]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[46]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[47]  Michal J. Okoniewski,et al.  Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations , 2006, BMC Bioinformatics.

[48]  H. Zhang,et al.  A transcriptome anatomy of human colorectal cancers , 2006, BMC Cancer.

[49]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[50]  Yusuke Nakamura,et al.  A novel oncoprotein RNF43 functions in an autocrine manner in colorectal cancer. , 2004, International journal of oncology.

[51]  A. Camargo,et al.  Identification of human exons overexpressed in tumors through the use of genome and expressed sequence data. , 2005, Physiological genomics.

[52]  Zsuzsanna Dosztányi,et al.  Transmembrane proteins in the Protein Data Bank: identification and classification , 2004, Bioinform..

[53]  Zsuzsanna Dosztányi,et al.  PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank , 2004, Nucleic Acids Res..

[54]  S. Altschul,et al.  A public database for gene expression in human cancers. , 1999, Cancer research.

[55]  Jennifer Daub,et al.  Expressed sequence tags: medium-throughput protocols. , 2004, Methods in molecular biology.

[56]  K. Okubo,et al.  cDNA analyses in the human genome project. , 1993, Gene.

[57]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[58]  D. Stekel,et al.  The comparison of gene expression from multiple cDNA libraries. , 2000, Genome research.

[59]  István Simon,et al.  TOPDB: topology data bank of transmembrane proteins , 2007, Nucleic Acids Res..

[60]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[61]  E. Kirkness,et al.  cDNA sequencing: a means of understanding cellular physiology. , 1994, Current opinion in biotechnology.

[62]  G. Pesole,et al.  Identification of tumor-associated cassette exons in human cancer through EST-based computational prediction and experimental validation , 2010, Molecular Cancer.