Artificial intelligence in genomic sequence, protein structure function prediction and DNA microarrays: a survey

Bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry) and applying 'informatics techniques' (derived from disciplines such as applied mathematics, computer science and statistics) to understand and organise the information associated with these molecules on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications. Many artificial intelligence (AI) methods have been employed in the field of bioinformatics. In this paper, we will introduce the application of AI methods mainly in three fields: genomic sequence, protein structure and function prediction and DNA microarrays. AI methods surveyed in this paper cover artificial neural network, support vector machine (SVM), ensemble learning, hidden Markov model, and some other conventional method like rough set, decision tree, K-nearest neighbour.

[1]  Raymond J. Mooney,et al.  Diverse ensembles for active learning , 2004, ICML.

[2]  Hu Chen,et al.  A novel method for protein secondary structure prediction using dual‐layer SVM and profiles , 2004, Proteins.

[3]  K. Chou,et al.  Artificial Neural Network Model for Predicting Membrane Protein Types , 2001, Journal of biomolecular structure & dynamics.

[4]  Guo-Ping Zhou,et al.  An Intriguing Controversy over Protein Structural Class Prediction , 1998, Journal of protein chemistry.

[5]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[6]  Guo-Zheng Li,et al.  Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins , 2008, Molecular Diversity.

[7]  K. Chou,et al.  Prediction and classification of domain structural classes , 1998, Proteins.

[8]  Liang-Ying Wei,et al.  Data mining of the GAW14 simulated data using rough set theory and tree-based methods , 2005, BMC Genetics.

[9]  Kuo-Chen Chou,et al.  Classification and prediction of ߖturn types by neural network , 1999 .

[10]  Witold Pedrycz,et al.  Ambient Intelligence, Wireless Networking, And Ubiquitous Computing , 2006 .

[11]  Raymond J. Mooney,et al.  Constructing Diverse Classifier Ensembles using Artificial Training Examples , 2003, IJCAI.

[12]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[13]  Kuo-Chen Chou,et al.  Using stacked generalization to predict membrane protein types based on pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[14]  Lin Lu,et al.  HIV‐1 protease cleavage site prediction based on amino acid property , 2009, J. Comput. Chem..

[15]  Jan Komorowski,et al.  Predicting Gene Function from Gene Expressions and Ontologies , 2000, Pacific Symposium on Biocomputing.

[16]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[17]  Kuo-Chen Chou,et al.  Notes & TipsArtificial Neural Network Model for Predicting α-Turn Types☆ , 1999 .

[18]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[20]  Christopher S. Oehmen,et al.  SVM-BALSA: Remote homology detection based on Bayesian sequence alignment , 2005, Comput. Biol. Chem..

[21]  Kuo-Chen Chou,et al.  Identify catalytic triads of serine hydrolases by support vector machines. , 2004, Journal of theoretical biology.

[22]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[23]  Kuo-Chen Chou,et al.  Coupling interaction between thromboxane A2 receptor and alpha-13 subunit of guanine nucleotide-binding protein. , 2005, Journal of proteome research.

[24]  Sharon L R Kardia,et al.  Hidden Markov model for defining genomic changes in lung cancer using gene expression data. , 2006, Omics : a journal of integrative biology.

[25]  K C Chou,et al.  Protein folding classes: a geometric interpretation of the amino acid composition of globular proteins. , 1994, Protein engineering.

[26]  Kuo-Chen Chou,et al.  Modelling extracellular domains of GABA-A receptors: subtypes 1, 2, 3, and 5. , 2004, Biochemical and biophysical research communications.

[27]  K C Chou,et al.  Prediction of protein structural classes and subcellular locations. , 2000, Current protein & peptide science.

[28]  Sung-Bae Cho Exploring Features and Classifiers to Classify Gene Expression Profiles of Acute Leukemia , 2002, Int. J. Pattern Recognit. Artif. Intell..

[29]  Raymond J. Mooney,et al.  Creating diverse ensemble classifiers to reduce supervision , 2005 .

[30]  Y Cai,et al.  Prediction of protein structural classes by neural network. , 2000, Biochimie.

[31]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[32]  G P Zhou,et al.  Some insights into protein structural class prediction , 2001, Proteins.

[33]  Jun Cai,et al.  Classifying G-protein coupled receptors with bagging classification tree , 2004, Comput. Biol. Chem..

[34]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[35]  R. Jernigan,et al.  Understanding the recognition of protein structural classes by amino acid composition , 1997, Proteins.

[36]  R. Palmer,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[37]  Kuo-Chen Chou,et al.  Insights from modelling the 3D structure of the extracellular domain of alpha7 nicotinic acetylcholine receptor. , 2004, Biochemical and biophysical research communications.

[38]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[39]  J. Chou,et al.  The structure of phospholamban pentamer reveals a channel-like architecture in membranes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[40]  K. Chou,et al.  Prediction of Protein Structural Classes by Modified Mahalanobis Discriminant Algorithm , 1998, Journal of protein chemistry.

[41]  Andrew Kusiak,et al.  Cancer gene search with data-mining and genetic algorithms , 2007, Comput. Biol. Medicine.

[42]  K. Chou Progress in protein structural class prediction and its impact to bioinformatics and proteomics. , 2005, Current protein & peptide science.

[43]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[44]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[45]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[46]  Yongjin Li,et al.  Discovering disease-genes by topological features in human protein-protein interaction network , 2006, Bioinform..

[47]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[48]  Xiaobo Zhou,et al.  Towards Automated Cellular Image Segmentation for RNAi Genome-Wide Screening , 2005, MICCAI.

[49]  Lorenz Wernisch,et al.  A Hidden Markov Model Web Application for Analysing Bacterial Genomotyping DNA Microarray Experiments , 2006, Applied bioinformatics.

[50]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[51]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[52]  Y D Cai,et al.  Using neural networks for prediction of domain structural classes. , 2000, Biochimica et biophysica acta.

[53]  Baishan Fang,et al.  LogitBoost classifier for discriminating thermophilic and mesophilic proteins. , 2007, Journal of biotechnology.

[54]  Kuo-Chen Chou,et al.  Predicting protein structural class with AdaBoost Learner. , 2006, Protein and peptide letters.

[55]  Huilin Xiong,et al.  Kernel-based distance metric learning for microarray data classification , 2006, BMC Bioinformatics.

[56]  M. Schena Genome analysis with gene expression microarrays. , 1996, BioEssays : news and reviews in molecular, cellular and developmental biology.

[57]  A A Mironov,et al.  [Membrane probability profile construction based on amino acids sequences multiple alignment]. , 2006, Molekuliarnaia biologiia.

[58]  P. Törönen,et al.  Analysis of gene expression data using self‐organizing maps , 1999, FEBS letters.

[59]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[60]  C. DeLisi,et al.  Prediction of protein structural class from the amino acid sequence , 1986, Biopolymers.

[61]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[62]  C. Zhang,et al.  Predicting protein folding types by distance functions that make allowances for amino acid interactions. , 1994, The Journal of biological chemistry.

[63]  O. Ptitsyn,et al.  Why do globular proteins fit the limited set of folding patterns? , 1987, Progress in biophysics and molecular biology.

[64]  K. Chou,et al.  Energy-optimized structure of antifreeze protein and its binding mechanism. , 1992, Journal of molecular biology.

[65]  Liang Liu,et al.  Predicting membrane protein types with bragging learner. , 2008, Protein and peptide letters.

[66]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[67]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[68]  Kuo-Chen Chou,et al.  Predicting protein structural class by functional domain composition. , 2004, Biochemical and biophysical research communications.

[69]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[70]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[71]  J. Richardson,et al.  β-Sheet topology and the relatedness of proteins , 1977, Nature.

[72]  Kevin N. Gurney,et al.  An introduction to neural networks , 2018 .

[73]  Kuo-Chen Chou,et al.  Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. , 2007, Protein and peptide letters.

[74]  S Brunak,et al.  Analysis of eukaryotic promoter sequences reveals a systematically occurring CT-signal. , 1995, Nucleic acids research.

[75]  Rong Zeng,et al.  Predicting O-glycosylation sites in mammalian proteins by using SVMs , 2006, Comput. Biol. Chem..

[76]  Ayumi Shinohara,et al.  Finding alphabet indexing for decision trees over regular patterns: an approach to bioinformatical knowledge acquisition , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[77]  Jens G. Reich,et al.  Kohonen map as a visualization tool for the analysis of protein sequences: multiple alignments, domains and segments of secondary structures , 1996, Comput. Appl. Biosci..

[78]  Marcin Feder,et al.  Phylogenomic analysis of the GIY-YIG nuclease superfamily , 2006, BMC Genomics.

[79]  Raymond J. Mooney,et al.  Experiments on Ensembles with Missing and Noisy Data , 2004, Multiple Classifier Systems.

[80]  Jude W. Shavlik,et al.  Protein Structure Prediction: Selecting Salient Features from Large Candidate Pools , 1993, ISMB.

[81]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[82]  C. Zhang,et al.  A joint prediction of the folding types of 1490 human proteins from their genetic codons. , 1993, Journal of theoretical biology.

[83]  Giorgio Valentini,et al.  Cancer recognition with bagged ensembles of support vector machines , 2004, Neurocomputing.

[84]  Teuvo Kohonen,et al.  An introduction to neural computing , 1988, Neural Networks.

[85]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of molecular biology.

[86]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[87]  Yonghong Peng,et al.  A novel ensemble machine learning for robust microarray data classification , 2006, Comput. Biol. Medicine.

[88]  O. Ptitsyn,et al.  Similarities of protein topologies: evolutionary divergence, functional convergence or principles of folding? , 1980, Quarterly Reviews of Biophysics.

[89]  Steven Salzberg,et al.  Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm , 1995, J. Comput. Biol..

[90]  Yi Pan,et al.  Rule generation for protein secondary structure prediction with support vector machines and decision tree , 2006, IEEE Transactions on NanoBioscience.

[91]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[92]  Heinz-Theodor Mevissen,et al.  Decision tree-based formation of consensus protein secondary structure prediction , 1999, Bioinform..

[93]  Chris Upton,et al.  Predicted Function of the Vaccinia Virus G5r Protein , 2022 .

[94]  Raymond J. Mooney,et al.  Creating diversity in ensembles using artificial data , 2005, Inf. Fusion.

[95]  K. Chou Structural bioinformatics and its impact to biomedical science. , 2004, Current medicinal chemistry.

[96]  Nitin Bhardwaj,et al.  Structural bioinformatics prediction of membrane-binding proteins. , 2006, Journal of molecular biology.

[97]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[98]  M. Borodovsky,et al.  Identification of new human cadherin genes using a combination of protein motif search and gene finding methods. , 2004, Journal of molecular biology.

[99]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[100]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[101]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[102]  H. D. Miller,et al.  The Theory Of Stochastic Processes , 1977, The Mathematical Gazette.

[103]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[104]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[105]  Lin Lu,et al.  GalNAc-transferase specificity prediction based on feature selection method , 2009, Peptides.

[106]  Tapio Salakoski,et al.  Extracting Protein-Protein Interaction Sentences by Applying Rough Set Data Analysis , 2004, Rough Sets and Current Trends in Computing.

[107]  P. Brown,et al.  Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.