Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features

BackgroundMachine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training.ResultsThe CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers.ConclusionsThe proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors.

[1]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[3]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[4]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[5]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[6]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[7]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[8]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[9]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[10]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[11]  Loren Hansen,et al.  Controlling feature selection in random forests of decision trees using a genetic algorithm: classification of class I MHC peptides. , 2009, Combinatorial chemistry & high throughput screening.

[12]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[13]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[14]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[15]  W. Atchley,et al.  Solving the protein sequence metric problem. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[16]  James Theiler,et al.  Online Feature Selection using Grafting , 2003, ICML.

[17]  Mathura S Venkatarajan,et al.  New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties , 2001 .

[18]  Björn Kolbeck,et al.  Peptide binding at class I major histocompatibility complex scored with linear functions and support vector machines. , 2004, Genome informatics. International Conference on Genome Informatics.

[19]  Volker Roth,et al.  The generalized LASSO , 2004, IEEE Transactions on Neural Networks.

[20]  Max F. Meyer,et al.  The Proof and Measurement of Association between Two Things. , 1904 .

[21]  V. K. Jayaraman,et al.  Feature selection and classification employing hybrid ant colony optimization/random forest methodology. , 2009, Combinatorial chemistry & high throughput screening.

[22]  Ernst-Walter Knapp,et al.  Exploring classification strategies with the CoEPrA 2006 contest , 2010, Bioinform..

[23]  C. Spearman The proof and measurement of association between two things. By C. Spearman, 1904. , 1987, The American journal of psychology.

[24]  Alexander G. Georgiev,et al.  Interpretable Numerical Descriptors of Amino Acid Space , 2009, J. Comput. Biol..

[25]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[26]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[27]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[28]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[29]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[30]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.