Metamorphic Testing for Quality Assurance of Protein Function Prediction Tools

Proteins are the workhorses of life and gaining insight on their functions is of paramount importance for applications such as drug design. However, the experimental validation of functions of proteins is highly-resource consuming. Therefore, recently, automated protein function prediction (AFP) using machine learning has gained significant interest. Many of these AFP tools are based on supervised learning models trained using existing gold-standard functional annotations, which are known to be incomplete. The main challenge associated with conducting systematic testing on AFP software is the lack of a test oracle, which determines passing or failing of a test case; unfortunately, due to the incompleteness of gold-standard data, the exact expected outcomes are not well defined for the AFP task. Thus, AFP tools face the oracle problem. In this work, we use metamorphic testing (MT) to test nine state-of-the-art AFP tools by defining a set of metamorphic relations (MRs) that apply input transformations to protein sequences. According to our results, we observe that several AFP tools fail all the test cases causing concerns over the quality of their predictions.

[1]  Daisuke Kihara,et al.  The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches , 2015, GigaScience.

[2]  Tapio Salakoski,et al.  Exploring Biomolecular Literature with EVEX: Connecting Genes through Events, Homology, and Indirect Associations , 2012, Adv. Bioinformatics.

[3]  Garry R. Cutting,et al.  Cystic fibrosis genetics: from molecular understanding to clinical application , 2014, Nature Reviews Genetics.

[4]  Eleni Giannoulatou,et al.  Verification and validation of bioinformatics software without a gold standard: a case study of BWA and Bowtie , 2014, BMC Bioinformatics.

[5]  Silvio C. E. Tosatto,et al.  INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity , 2015, Nucleic Acids Res..

[6]  Dave Towey,et al.  Metamorphic Relations for Enhancing System Understanding and Use , 2020, IEEE Transactions on Software Engineering.

[7]  Saso Dzeroski,et al.  Phyletic Profiling with Cliques of Orthologs Is Enhanced by Signatures of Paralogy Relationships , 2013, PLoS Comput. Biol..

[8]  L. L. Pullum,et al.  Early Results from Metamorphic Testing of Epidemiological Models , 2012, 2012 ASE/IEEE International Conference on BioMedical Computing (BioMedCom).

[9]  Michael J. Becich,et al.  Next generation sequencing in clinical medicine: Challenges and lessons for pathology and biomedical informatics , 2012, Journal of pathology informatics.

[10]  G. Jayandharan,et al.  Hemophilia: Genetics, Diagnosis and Treatment , 2011 .

[11]  Amarda Shehu,et al.  A Survey of Computational Methods for Protein Function Prediction , 2016 .

[12]  Kieran Sheahan,et al.  Lynch Syndrome: An Updated Review , 2014, Genes.

[13]  Ursula Pieper,et al.  SALIGN: a web server for alignment of multiple protein sequences and structures , 2012, Bioinform..

[14]  Liisa Holm,et al.  PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment , 2015, Bioinform..

[15]  Jenny C. Taylor,et al.  Are whole-exome and whole-genome sequencing approaches cost-effective? A systematic review of the literature , 2018, Genetics in Medicine.

[16]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[17]  Peter D. Karp,et al.  A systematic study of genome context methods: calibration, normalization and combination , 2010, BMC Bioinformatics.

[18]  D. Kihara,et al.  PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data , 2009, Proteins.

[19]  Upulee Kanewala,et al.  Experiences of Testing Bioinformatics Programs for Detecting Subtle Faults , 2016, 2016 IEEE/ACM International Workshop on Software Engineering for Science (SE4Science).

[20]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[21]  Jingyu Hou,et al.  Explore the hidden treasure in protein-protein interaction networks - An iterative model for predicting protein functions , 2015, J. Bioinform. Comput. Biol..

[22]  Maximilian E. R. Weiss,et al.  Next-generation sequencing of the BRCA1 and BRCA2 genes for the genetic diagnostics of hereditary breast and/or ovarian cancer. , 2015, The Journal of molecular diagnostics : JMD.

[23]  Renzhi Cao,et al.  Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. , 2016, Methods.

[24]  K. Dolinski,et al.  Use and misuse of the gene ontology annotations , 2008, Nature Reviews Genetics.

[25]  Jane S. Paulsen,et al.  Identification of Genetic Factors that Modify Clinical Onset of Huntington’s Disease , 2015, Cell.

[26]  Huai Liu,et al.  Metamorphic Testing , 2018, ACM Comput. Surv..

[27]  Huai Liu,et al.  An innovative approach for testing bioinformatics programs using metamorphic testing , 2009, BMC Bioinformatics.

[28]  J. Massano,et al.  An updated review of Parkinson's disease genetics and clinicopathological correlations , 2017, Acta neurologica Scandinavica.

[29]  Madhusudan Srinivasan,et al.  Quality Assurance of Bioinformatics Software: A Case Study of Testing a Biomedical Text Processing Tool Using Metamorphic Testing , 2018, 2018 IEEE/ACM 3rd International Workshop on Metamorphic Testing (MET).

[30]  A. Goate,et al.  Alzheimer’s Disease Genetics: From the Bench to the Clinic , 2014, Neuron.

[31]  The Gene Ontology Consortium,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2018, Nucleic Acids Res..

[32]  P. Bork,et al.  Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs , 2004, Nature Biotechnology.

[33]  James C. Hu,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2019 .

[34]  Elaine J. Weyuker,et al.  On Testing Non-Testable Programs , 1982, Comput. J..

[35]  Gaston H. Gonnet,et al.  The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements , 2014, Nucleic Acids Res..

[36]  Paolo Fontana,et al.  Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms , 2012, BMC Bioinformatics.

[37]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[38]  Amarda Shehu,et al.  Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space , 2014, BMC Bioinformatics.

[39]  Tsong Yueh Chen,et al.  Metamorphic Testing: A New Approach for Generating Next Test Cases , 2020, ArXiv.

[40]  A. Ramanathan,et al.  Verification of Compartmental Epidemiological Models Using Metamorphic Testing, Model Checking and Visual Analytics , 2012, 2012 ASE/IEEE International Conference on BioMedical Computing (BioMedCom).

[41]  Michael J. E. Sternberg,et al.  CombFunc: predicting protein function using heterogeneous data sources , 2012, Nucleic Acids Res..