fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

BackgroundParametric feature selection methods for machine learning and association studies based on genetic data are not robust with respect to outliers or influential observations. While rank-based, distribution-free statistics offer a robust alternative to parametric methods, their practical utility can be limited, as they demand significant computational resources when analyzing high-dimensional data. For genetic studies that seek to identify variants, the hypothesis is constrained, since it is typically assumed that the effect of the genotype on the phenotype is monotone (e.g., an additive genetic effect). Similarly, predictors for machine learning applications may have natural ordering constraints. Cross-validation for feature selection in these high-dimensional contexts necessitates highly efficient computational algorithms for the robust evaluation of many features.ResultsWe have developed an R extension package, fastJT, for conducting genome-wide association studies and feature selection for machine learning using the Jonckheere-Terpstra statistic for constrained hypotheses. The kernel of the package features an efficient algorithm for calculating the statistics, replacing the pairwise comparison and counting processes with a data sorting and searching procedure, reducing computational complexity from O(n2) to O(n log(n)). The computational efficiency is demonstrated through extensive benchmarking, and example applications to real data are presented.ConclusionsfastJT is an open-source R extension package, applying the Jonckheere-Terpstra statistic for robust feature selection for machine learning and association studies. The package implements an efficient algorithm which leverages internal information among the samples to avoid unnecessary computations, and incorporates shared-memory parallel programming to further boost performance on multi-core machines.

[1]  K. Owzar,et al.  Blood‐based markers of efficacy and resistance to cetuximab treatment in metastatic colorectal cancer: results from CALGB 80203 (Alliance) , 2016, Cancer medicine.

[2]  T. J. Terpstra,et al.  The asymptotic normality and consistency of kendall's test against trend, when ties are present in one ranking , 1952 .

[3]  D. Wolfe,et al.  Nonparametric Statistical Methods. , 1974 .

[4]  C. Borror Nonparametric Statistical Methods, 2nd, Ed. , 2001 .

[5]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[6]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[7]  R. Schilsky,et al.  Gemcitabine plus bevacizumab compared with gemcitabine plus placebo in patients with advanced pancreatic cancer: phase III trial of the Cancer and Leukemia Group B (CALGB 80303). , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[8]  Asheber Abebe,et al.  Smooth Nonparametric Allocation of Classification , 2011, Commun. Stat. Simul. Comput..

[9]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[10]  Douglas G Altman,et al.  Parametric v non-parametric methods for data analysis , 2009, BMJ : British Medical Journal.

[11]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[12]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[13]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[14]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[15]  Sin-Ho Jung,et al.  Statistical Considerations for Analysis of Microarray Experiments , 2011, Clinical and translational science.

[16]  E. Wit Design and Analysis of DNA Microarray Investigations , 2004, Human Genomics.

[17]  U. Brinkmann,et al.  Functional polymorphisms of the human multidrug-resistance gene: multiple sequence variations and correlation of one allele with P-glycoprotein expression and activity in vivo. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[19]  Ivo D. Shterev,et al.  Genetic variation determines VEGF-A plasma levels in cancer patients , 2018, Scientific Reports.

[20]  S. Rampal,et al.  Association of glucokinase regulatory gene polymorphisms with risk and severity of non-alcoholic fatty liver disease: an interaction study with adiponutrin gene , 2014, Journal of Gastroenterology.

[21]  N. Kamatani,et al.  An SNP in CYP39A1 is associated with severe neutropenia induced by docetaxel , 2012, Cancer Chemotherapy and Pharmacology.

[22]  S. Kaasa,et al.  The Val158Met polymorphism of the human catechol-O-methyltransferase (COMT) gene may influence morphine requirements in cancer pain patients , 2005, Pain.

[23]  John C. Davis,et al.  MRI of the sacroiliac joints in patients with moderate to severe ankylosing spondylitis. , 2006, AJR. American journal of roentgenology.

[24]  R. Rosenfeld Patients , 2012, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[25]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[26]  Yusuke Nakamura,et al.  A Genome-Wide Association Study of Overall Survival in Pancreatic Cancer Patients Treated with Gemcitabine in CALGB 80303 , 2011, Clinical Cancer Research.

[27]  S. Raimondi,et al.  Karyotypic abnormalities create discordance of germline genotype and cancer cell phenotypes , 2005, Nature Genetics.

[28]  K. Hirata,et al.  CHST3 and CHST13 polymorphisms as predictors of bosentan‐induced liver toxicity in Japanese patients with pulmonary arterial hypertension , 2018, Pharmacological research.

[29]  D. Krieger,et al.  Correlation Between Ammonia Levels and the Severity of Hepatic Encephalopathy , 2004 .

[30]  A. R. Jonckheere,et al.  A DISTRIBUTION-FREE k-SAMPLE TEST AGAINST ORDERED ALTERNATIVES , 1954 .

[31]  M. Bertagnolli,et al.  Prognostic and Predictive Blood-Based Biomarkers in Patients with Advanced Pancreatic Cancer: Results from CALGB80303 (Alliance) , 2013, Clinical Cancer Research.

[32]  T. Saibara,et al.  Genetic Polymorphisms of the Human PNPLA3 Gene Are Strongly Associated with Severity of Non-Alcoholic Fatty Liver Disease in Japanese , 2012, PloS one.