EBT: a statistic test identifying moderate size of significant features with balanced power and precision for genome‐wide rate comparisons

Motivation: In genome‐wide rate comparison studies, there is a big challenge for effective identification of an appropriate number of significant features objectively, since traditional statistical comparisons without multi‐testing correction can generate a large number of false positives while multi‐testing correction tremendously decreases the statistic power. Results: In this study, we proposed a new exact test based on the translation of rate comparison to two binomial distributions. With modeling and real datasets, the exact binomial test (EBT) showed an advantage in balancing the statistical precision and power, by providing an appropriate size of significant features for further studies. Both correlation analysis and bootstrapping tests demonstrated that EBT is as robust as the typical rate‐comparison methods, e.g. χ2 test, Fisher's exact test and Binomial test. Performance comparison among machine learning models with features identified by different statistical tests further demonstrated the advantage of EBT. The new test was also applied to analyze the genome‐wide somatic gene mutation rate difference between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), two main lung cancer subtypes and a list of new markers were identified that could be lineage‐specifically associated with carcinogenesis of LUAD and LUSC, respectively. Interestingly, three cilia genes were found selectively with high mutation rates in LUSC, possibly implying the importance of cilia dysfunction in the carcinogenesis. Availability and implementation: An R package implementing EBT could be downloaded from the website freely: http://www.szu‐bioinf.org/EBT. Contact: wangyj@szu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Jian Wang,et al.  Global analysis of chromosome 1 genes among patients with lung adenocarcinoma, squamous carcinoma, large-cell carcinoma, small-cell carcinoma, or non-cancer , 2015, Cancer and Metastasis Reviews.

[2]  W. Pan,et al.  A Powerful Pathway-Based Adaptive Test for Genetic Association with Common or Rare Variants. , 2015, American journal of human genetics.

[3]  Andrew R. Coggan,et al.  Absence of an effect of liposuction on insulin action and risk factors for coronary heart disease. , 2004, The New England journal of medicine.

[4]  Steven J. M. Jones,et al.  Comprehensive molecular profiling of lung adenocarcinoma , 2014, Nature.

[5]  Yejun Wang,et al.  T3_MM: A Markov Model Effectively Classifies Bacterial Type III Secretion Signals , 2013, PloS one.

[6]  K. Krishnamoorthy Handbook of statistical distributions with applications , 2006 .

[7]  Tom R. Gaunt,et al.  The effects of height and BMI on prostate cancer incidence and mortality: a Mendelian randomization study in 20,848 cases and 20,214 controls from the PRACTICAL consortium , 2015, Cancer Causes & Control.

[8]  Chris C A Spencer,et al.  A novel locus of resistance to severe malaria in a region of ancient balancing selection , 2018 .

[9]  Deanne M. Taylor,et al.  Powerful SNP-set analysis for case-control genome-wide association studies. , 2010, American journal of human genetics.

[10]  C. Viebahn,et al.  The mouse homeobox gene Noto regulates node morphogenesis, notochordal ciliogenesis, and left–right patterning , 2007, Proceedings of the National Academy of Sciences.

[11]  S D Stellman,et al.  Smoking and lung cancer risk in American and Japanese men: an international case-control study. , 2001, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[12]  Heymut Omran,et al.  Primary ciliary dyskinesia: Clinical presentation, diagnosis and genetics , 2005, Annals of medicine.

[13]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[14]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[15]  N. Kaplan,et al.  Issues concerning association studies for fine mapping a susceptibility gene for a complex disease , 2001, Genetic epidemiology.

[16]  Vilmundur Gudnason,et al.  Diabetes Mellitus, Fasting Glucose, and Risk of Cause-Specific Death , 2011 .

[17]  Steven J. M. Jones,et al.  Comprehensive genomic characterization of squamous cell lung cancers , 2012, Nature.

[18]  Yejun Wang,et al.  Prediction of bacterial type IV secreted effectors by C-terminal features , 2014, BMC Genomics.

[19]  J. Tímár,et al.  The clinical relevance of KRAS gene mutation in non-small-cell lung cancer , 2014, Current opinion in oncology.

[20]  Jie Zhang,et al.  Exome sequencing identifies frequent mutation of MLL2 in non–small cell lung carcinoma from Chinese patients , 2014, Scientific Reports.

[21]  N. Sasaki,et al.  Helicobacter pylori infection and the development of gastric cancer. , 2001, The New England journal of medicine.

[22]  David O Wilson,et al.  Lung Cancer Risk Prediction Using Common SNPs Located in GWAS-Identified Susceptibility Regions , 2015, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[23]  E. Erdfelder,et al.  Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses , 2009, Behavior research methods.

[24]  E. Falconnet,et al.  DNAI1 Mutations Explain Only 2% of Primary Ciliary Dykinesia , 2008, Respiration.

[25]  Yong Soo Choi,et al.  Integrative and comparative genomic analysis of lung squamous cell carcinomas in East Asian patients. , 2014, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[26]  E. Birney,et al.  Patterns of somatic mutation in human cancer genomes , 2007, Nature.

[27]  Yejun Wang,et al.  An empirical strategy to detect bacterial transcript structure from directional RNA-seq transcriptome data , 2015, BMC Genomics.

[28]  Li Jin,et al.  Prediction of lung cancer risk in a Chinese population using a multifactorial genetic model , 2012, BMC Medical Genetics.

[29]  Christopher I. Amos,et al.  A Network-Based Kernel Machine Test for the Identification of Risk Pathways in Genome-Wide Association Studies , 2014, Human Heredity.

[30]  M. Cowperthwaite,et al.  Cilia gene expression patterns in cancer. , 2014, Cancer genomics & proteomics.

[31]  S. Heath,et al.  A follow-up study of a genome-wide association scan identifies a susceptibility locus for venous thrombosis on chromosome 6p24.1. , 2010, American journal of human genetics.

[32]  Warren W. Kretzschmar,et al.  Sparse whole genome sequencing identifies two loci for major depressive disorder , 2015, Nature.

[33]  Shaun M. Purcell,et al.  Statistical power and significance testing in large-scale genetic studies , 2014, Nature Reviews Genetics.

[34]  R. Fisher 019: On the Interpretation of x2 from Contingency Tables, and the Calculation of P. , 1922 .

[35]  C. Dunnett A Multiple Comparison Procedure for Comparing Several Treatments with a Control , 1955 .

[36]  Sukjoon Yoon,et al.  Somatic Mutaome Profile in Human Cancer Tissues , 2013, Genomics & informatics.

[37]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[38]  Chuang Liu,et al.  A Gene Gravity Model for the Evolution of Cancer Genomes: A Study of 3,000 Cancer Genomes across 9 Cancer Types , 2015, PLoS Comput. Biol..

[39]  Jeong-Hwa Lee,et al.  Immunohistochemical localization of LLC1 in human tissues and its limited expression in non-small cell lung cancer. , 2015, Histology and histopathology.

[40]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[41]  Nazneen Rahman,et al.  Common variations in BARD1 influence susceptibility to high-risk neuroblastoma , 2009, Nature Genetics.

[42]  Hongbing Shen,et al.  Breast cancer risk assessment with five independent genetic variants and two risk factors in Chinese women , 2012, Breast Cancer Research.

[43]  Zhaoyuan Fang,et al.  Transdifferentiation of lung adenocarcinoma in mice with Lkb1 deficiency to squamous cell carcinoma , 2014, Nature Communications.

[44]  Peter Kraft,et al.  Four Susceptibility Loci for Gallstone Disease Identified in a Meta-analysis of Genome-Wide Association Studies. , 2016, Gastroenterology.

[45]  Philip T Cagle,et al.  Emerging Biomarkers in Personalized Therapy of Lung Cancer. , 2016, Advances in experimental medicine and biology.

[46]  Zhaoyuan Fang,et al.  LKB1 Inactivation Elicits a Redox Imbalance to Modulate Non-small Cell Lung Cancer Plasticity and Therapeutic Response. , 2015, Cancer cell.

[47]  Jie Chen,et al.  A support vector machine model for predicting non-sentinel lymph node status in patients with sentinel lymph node positive breast cancer , 2013, Tumor Biology.

[48]  Oguzhan Alagoz,et al.  Developing a clinical utility framework to evaluate prediction models in radiogenomics , 2015, Medical Imaging.

[49]  E. Burchard,et al.  Novel genetic risk factors for asthma in African American children: Precision Medicine and the SAGE II Study , 2016, Immunogenetics.