An efficient distributed protein disorder prediction with pasted samples

Abstract In this paper, we compare prediction performance of a machine learning classifier constructed at once in memory with an ensemble of models constructed with the pasting procedure for protein disorder prediction. The pasting procedure takes sample bites of the training data as input, constructs a classification predictor on each sample and pastes the predictors together. This method has not been previously tested on protein structure data. With a sufficiently large sample size we observed increased performance for the pasting procedure compared with a single model constructed at once in memory for all window sizes. We attribute this increased performance to the robustness of the statistical query learning model. This procedure provides a means to improve classification performance at the protein disorder prediction task as well as construct models too large to be held at once in memory.

[1]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[2]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[3]  Hassan Mathkour,et al.  An Integrated Approach for Protein Structure Prediction Using Artificial Neural Network , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[4]  Charu C. Aggarwal,et al.  Outlier ensembles: position paper , 2013, SKDD.

[5]  G. Fasman Circular Dichroism and the Conformational Analysis of Biomolecules , 1996, Springer US.

[6]  Kate Smith-Miles,et al.  On learning algorithm selection for classification , 2006, Appl. Soft Comput..

[7]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[8]  Ming Li,et al.  Learning in the Presence of Malicious Errors , 1993, SIAM J. Comput..

[9]  Louis Wehenkel,et al.  On the Encoding of Proteins for Disordered Regions Prediction , 2013, PloS one.

[10]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[11]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[12]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[13]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[14]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[15]  Nitesh V. Chawla,et al.  Learning Ensembles from Bites: A Scalable and Accurate Approach , 2004, J. Mach. Learn. Res..

[16]  Gilles Louppe,et al.  Ensembles on Random Patches , 2012, ECML/PKDD.

[17]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[18]  Kengo Kinoshita,et al.  Prediction of disordered regions in proteins based on the meta approach , 2008, Bioinform..

[19]  Yu-Yen Ou,et al.  Protein disorder prediction by condensed PSSM considering propensity for order or disorder , 2006, BMC Bioinformatics.

[20]  Peter Tompa,et al.  Structural Characterization of Intrinsically Disordered Proteins by NMR Spectroscopy , 2013, Molecules.

[21]  Shuichi Hirose,et al.  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm330 Structural bioinformatics , 2022 .

[22]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[23]  Yishay Mansour,et al.  Weakly learning DNF and characterizing statistical query learning using Fourier analysis , 1994, STOC '94.

[24]  Christine A. Orengo,et al.  Inferring Function Using Patterns of Native Disorder in Proteins , 2007, PLoS Comput. Biol..

[25]  Vitaly Feldman,et al.  On using extended statistical queries to avoid membership queries , 2002 .

[26]  A. Dunker,et al.  Understanding protein non-folding. , 2010, Biochimica et biophysica acta.

[27]  Tamara G. Kolda,et al.  COMET: A Recipe for Learning and Using Large Ensembles on Massive Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[28]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[29]  Sanjeev Arora,et al.  Computational Complexity: A Modern Approach , 2009 .

[30]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[31]  Jianlin Cheng,et al.  DNdisorder: predicting protein disorder using boosting and deep networks , 2013, BMC Bioinformatics.

[32]  Pedro M. Domingos Bayesian Averaging of Classifiers and the Overfitting Problem , 2000, ICML.

[33]  A Keith Dunker,et al.  SPINE-D: Accurate Prediction of Short and Long Disordered Regions by a Single Neural-Network Based Method , 2012, Journal of biomolecular structure & dynamics.

[34]  Taeho Jo,et al.  Improving protein fold recognition by random forest , 2014, BMC Bioinformatics.

[35]  Jianlin Cheng,et al.  Machine Learning Methods for Protein Structure Prediction , 2008, IEEE Reviews in Biomedical Engineering.

[36]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[37]  Lior Rokach,et al.  Troika - An improved stacking schema for classification tasks , 2009, Inf. Sci..

[38]  Ian H. Witten,et al.  Stacking Bagged and Dagged Models , 1997, ICML.

[39]  Bertrand Clarke,et al.  Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot be Ignored , 2003, J. Mach. Learn. Res..

[40]  Roland L. Dunbrack,et al.  Assessment of disorder predictions in CASP6 , 2005, Proteins.

[41]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[42]  Dagmar Ringe,et al.  [19]Study of protein dynamics by X-ray diffraction , 1986 .

[43]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[44]  David T. Jones,et al.  Prediction of disordered regions in proteins from position specific score matrices , 2003, Proteins.

[45]  Carlos Soares,et al.  A Comparison of Ranking Methods for Classification Algorithm Selection , 2000, ECML.

[46]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[47]  L. Breiman Stacked Regressions , 1996, Machine Learning.

[48]  Sumaiya Iqbal,et al.  Improved protein disorder predictor by smoothing output , 2014, 2014 17th International Conference on Computer and Information Technology (ICCIT).

[49]  Anna Tramontano,et al.  Evaluation of disorder predictions in CASP9 , 2011, Proteins.

[50]  Nicolle H. Packer,et al.  Amino acid analysis protocols , 2000 .

[51]  Philip E. Bourne,et al.  The RCSB PDB information portal for structural genomics , 2005, Nucleic Acids Res..

[52]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[53]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[54]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[55]  D. Wolpert,et al.  No Free Lunch Theorems for Search , 1995 .

[56]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[57]  Silvio C. E. Tosatto,et al.  ESpritz: accurate and fast prediction of protein disorder , 2012, Bioinform..

[58]  David H. Bailey,et al.  NAS parallel benchmark results , 1992, Proceedings Supercomputing '92.

[59]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[60]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[61]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[62]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[63]  Bernard Zenko,et al.  Is Combining Classifiers with Stacking Better than Selecting the Best One? , 2004, Machine Learning.

[64]  Padhraic Smyth,et al.  Linearly Combining Density Estimators via Stacking , 1999, Machine Learning.

[65]  Anna Tramontano,et al.  Assessment of protein disorder region predictions in CASP10 , 2014, Proteins.

[66]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .