An effective, practical and low computational cost framework for the integration of heterogeneous data to predict functional associations between proteins by means of Artificial Neural Networks

Nowadays, the uncovering of new functional relationships between proteins is one of the major goals of biological studies. For this task, the integration of evidences from heterogeneous data sources by means of machine learning methodologies has been demonstrated to be an effective way of providing a complete genome-wide functional network and more accurate inferences of new functional associations. This work presents a new framework to be used in Artificial Neural Networks (ANNs) for the task of predicting functional relationships between proteins through the integration of evidences from heterogeneous data sources. The developing of such new methodology is motivated by the problems that arise when applying ANNs to this kind of problems, namely, the computational cost of ANN optimization process due to the nature of data (large number of instances and high dimensionality). The method selects smaller representative/non-random subsets from the original data set selected for ANN optimization process, resulting in a reduction of the number of data to be trained and, consequently, the computational cost. Moreover, the fact that the subsets are not only smaller, but also representative from the original one, (i) prevents the repetition of the optimization process several times with different random subsets of data, which is commonly used to get a reliable and fair evaluation of ANN's prediction accuracy, and (ii) benefits the learning procedure in the sense of a reduction of the overfitting problem, improving, this way, the prediction ability.

[1]  J. J. Díaz-Mejía,et al.  Network-based function prediction and interactomics: the case for metabolic enzymes. , 2011, Metabolic engineering.

[2]  Matthew A Care,et al.  GO-At: in silico prediction of gene function in Arabidopsis thaliana by combining heterogeneous data. , 2010, The Plant journal : for cell and molecular biology.

[3]  Ali A. Ghorbani,et al.  Improved competitive learning neural networks for network intrusion and fraud detection , 2012, Neurocomputing.

[4]  Ignacio Rojas,et al.  Predicting the accuracy of multiple sequence alignment algorithms by using computational intelligent techniques , 2012, Nucleic acids research.

[5]  Huiru Zheng,et al.  A knowledge-driven probabilistic framework for the prediction of protein-protein interaction networks , 2010, Comput. Biol. Medicine.

[6]  Xiaoou Li,et al.  Automated Nonlinear System Modeling with Multiple Fuzzy Neural Networks and Kernel Smoothing , 2010, Int. J. Neural Syst..

[7]  Insuk Lee Probabilistic functional gene societies. , 2011, Progress in biophysics and molecular biology.

[8]  Igor Jurisica,et al.  Integrative computational biology for cancer research , 2011, Human Genetics.

[9]  Ginés Rubio,et al.  Using near-infrared spectroscopy in the classification of white and iberian pork with neural networks , 2010, Neural Computing and Applications.

[10]  E. Snitkin,et al.  Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network , 2009, Genome Biology.

[11]  Charles DeLisi,et al.  High-precision high-coverage functional inference from integrated data sources , 2008, BMC Bioinformatics.

[12]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[13]  Shailesh V. Date,et al.  A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[14]  David Z. D'Argenio,et al.  Prediction of human functional genetic networks from heterogeneous data using RVM-based ensemble learning , 2010, Bioinform..

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[17]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[18]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[19]  Beatriz García-Jiménez,et al.  Inference of Functional Relations in Predicted Protein Networks with a Machine Learning Approach , 2010, PloS one.

[20]  Fei Long,et al.  Multilayer neural networks-based direct adaptive control for switched nonlinear systems , 2010, Neurocomputing.

[21]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[22]  Héctor Pomares,et al.  A deterministic model selection scheme for incremental RBFNN construction in time series forecasting , 2010, Neural Computing and Applications.

[23]  Adam P. Piotrowski,et al.  Optimizing neural networks for river flow forecasting – Evolutionary Computation methods versus the Levenberg–Marquardt approach , 2011 .

[24]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[25]  T. Barrette,et al.  Probabilistic model of the human protein-protein interaction network , 2005, Nature Biotechnology.

[26]  Giorgio Valentini,et al.  Integration of heterogeneous data sources for gene function prediction using decision templates and ensembles of learning machines , 2010, Neurocomputing.

[27]  Christian von Mering,et al.  STRING: a database of predicted functional associations between proteins , 2003, Nucleic Acids Res..

[28]  Matteo Pellegrini,et al.  Prolinks: a database of protein functional linkages derived from coevolution , 2004, Genome Biology.

[29]  E. Marcotte,et al.  Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network , 2011, Nature Protocols.

[30]  Arash Bahrammirzaee,et al.  A comparative survey of artificial intelligence applications in finance: artificial neural networks, expert system and hybrid intelligent systems , 2010, Neural Computing and Applications.

[31]  Pornpimol Charoentong,et al.  Information technology solutions for integration of biomolecular and clinical data in the identification of new cancer biomarkers and targets for therapy. , 2010, Pharmacology & therapeutics.

[32]  Dmitrij Frishman,et al.  The MIPS mammalian protein?Cprotein interaction database , 2005, Bioinform..

[33]  Gary D Bader,et al.  The Genetic Landscape of a Cell , 2010, Science.

[34]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2012 update , 2011, Nucleic Acids Res..

[35]  Héctor Pomares,et al.  An enhanced clustering function approximation technique for a radial basis function neural network , 2012, Math. Comput. Model..

[36]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[37]  Tipu Z. Aziz,et al.  Prediction of Parkinson's Disease tremor Onset Using a Radial Basis Function Neural Network Based on Particle Swarm Optimization , 2010, Int. J. Neural Syst..

[38]  A. Fraser,et al.  Predicting genetic modifier loci using functional gene networks. , 2010, Genome research.

[39]  Alberto Prieto,et al.  Patented Biomarkers of Peripheral Blood for the Early Detection of Cancer , 2012 .

[40]  Héctor Pomares,et al.  Method for prediction of protein-protein interactions in yeast using genomics/proteomics information and feature selection , 2009, Neurocomputing.

[41]  Gautier Koscielny,et al.  Ensembl Genomes: an integrative resource for genome-scale data from non-vertebrate species , 2011, Nucleic Acids Res..

[42]  Christopher C. Moser,et al.  Natural engineering principles of electron tunnelling in biological oxidation–reduction , 1999, Nature.

[43]  Michaël Aupetit,et al.  Nearly homogeneous multi-partitioning with a deterministic generator , 2009, Neurocomputing.

[44]  Héctor Pomares,et al.  Generating Balanced Learning and Test Sets for Function Approximation Problems , 2011, Int. J. Neural Syst..

[45]  David B. Dunson,et al.  Bayesian Inference for Genomic Data Integration Reduces Misclassification Rate in Predicting Protein-Protein Interactions , 2011, PLoS Comput. Biol..

[46]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[47]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[48]  Michael Cherry,et al.  South African museums' status ‘at risk’ , 1997, Nature.

[49]  Peng Li,et al.  Global protein interactome exploration through mining genome-scale data in Arabidopsis thaliana , 2010, BMC Genomics.

[50]  Christie S. Chang,et al.  The BioGRID interaction database: 2013 update , 2012, Nucleic Acids Res..

[51]  S. Dwight,et al.  Genetic and physical maps of Saccharomyces cerevisiae. , 1997, Methods in enzymology.

[52]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[53]  Héctor Pomares,et al.  Using machine learning techniques and genomic/proteomic information from known databases for defining relevant features for PPI classification , 2012, Comput. Biol. Medicine.

[54]  Hsinchun Chen,et al.  A framework of integrating gene relations from heterogeneous data sources: an experiment on Arabidopsis thaliana , 2006, Bioinform..

[55]  E. Marcotte,et al.  An Improved, Bias-Reduced Probabilistic Functional Gene Network of Baker's Yeast, Saccharomyces cerevisiae , 2007, PloS one.

[56]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[57]  J. Xuan,et al.  Classification algorithms for phenotype prediction in genomics and proteomics. , 2008, Frontiers in bioscience : a journal and virtual library.

[58]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[59]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2011 update , 2010, Nucleic Acids Res..

[60]  C. Greenwood,et al.  Data Integration in Genetics and Genomics: Methods and Challenges , 2009, Human genomics and proteomics : HGP.