Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm

A key element of bioinformatics research is the extraction of meaningful information from large experimental data sets. Various approaches, including statistical and graph theoretical methods, data mining, and computational pattern recognition, have been applied to this task with varying degrees of success. Using a novel classifier based on the Bayes discriminant function, we present a hybrid algorithm that employs feature selection and extraction to isolate salient features from large medical and other biological data sets. We have previously shown that a genetic algorithm coupled with a k-nearest-neighbors classifier performs well in extracting information about protein-water binding from X-ray crystallographic protein structure data. The effectiveness of the hybrid EC-Bayes classifier is demonstrated to distinguish the features of this data set that are the most statistically relevant and to weight these features appropriately to aid in the prediction of solvation sites.

[1]  Philip M. Dean,et al.  Hydration in drug design. 2. Influence of local site surface shape on water binding , 1995, J. Comput. Aided Mol. Des..

[2]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[3]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[4]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  William R. Pitt,et al.  AQUARIUS2: Knowledge‐based modeling of solvent sites around proteins , 1993, J. Comput. Chem..

[7]  DistAl: An inter-pattern distance-based constructive learning algorithm , 1999, Intell. Data Anal..

[8]  J. Ross Quinlan,et al.  Simplifying decision trees , 1987, Int. J. Hum. Comput. Stud..

[9]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[10]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Gerard V. Trunk,et al.  A Problem of Dimensionality: A Simple Example , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[13]  E. Baker,et al.  Hydrogen bonding in globular proteins. , 1984, Progress in biophysics and molecular biology.

[14]  Anil K. Jain,et al.  Parsimonious network design and feature selection through node pruning , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[15]  Jack Sklansky,et al.  A note on genetic algorithms for large-scale feature selection , 1989, Pattern Recognit. Lett..

[16]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[17]  Philip M. Dean,et al.  Hydration in drug design. 1. Multiple hydrogen-bonding features of water molecules in mediating protein-ligand interactions , 1995, J. Comput. Aided Mol. Des..

[18]  J. Tainer,et al.  Atomic and residue hydrophilicity in the context of folded protein structures , 1995, Proteins.

[19]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[20]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[21]  Hisao Ishibuchi,et al.  Adaptive fuzzy rule-based classification systems , 1996, IEEE Trans. Fuzzy Syst..

[22]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[23]  Paul Compton,et al.  Inductive knowledge acquisition: a case study , 1987 .

[24]  Sholom M. Weiss,et al.  An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods , 1989, IJCAI.

[25]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Computing k-Nearest Neighbors , 1975, IEEE Transactions on Computers.

[26]  Richard J. Enbody,et al.  Further Research on Feature Selection and Classification Using Genetic Algorithms , 1993, ICGA.

[27]  Haleh Vafaie,et al.  Evolutionary Feature Space Transformation , 1998 .

[28]  Hisao Ishibuchi,et al.  Selecting fuzzy if-then rules for classification problems using genetic algorithms , 1995, IEEE Trans. Fuzzy Syst..

[29]  Angelo Vedani,et al.  Algorithm for the systematic solvation of proteins based on the directionality of hydrogen bonds , 1991 .

[30]  Jan M. Van Campenhout,et al.  On the Possible Orderings in the Measurement Selection Problem , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[31]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[32]  M. L. Connolly Solvent-accessible surfaces of proteins and nucleic acids. , 1983, Science.

[33]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[34]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[35]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[36]  J. Tainer,et al.  The interdependence of protein surface topography and bound water molecules revealed by surface accessibility and fractal density measures. , 1992, Journal of molecular biology.

[37]  Lawrence Davis,et al.  Hybridizing the Genetic Algorithm and the K Nearest Neighbors Classification Algorithm , 1991, ICGA.

[38]  Kevin Baker,et al.  Classification of radar returns from the ionosphere using neural networks , 1989 .

[39]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[40]  W. Punch,et al.  Predicting conserved water-mediated and polar ligand interactions in proteins using a K-nearest-neighbors genetic algorithm. , 1997, Journal of molecular biology.

[41]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[42]  Jack Sklansky,et al.  A note on genetic algorithms for large-scale feature selection , 1989, Pattern Recognition Letters.

[43]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[44]  Sheng Chen,et al.  Robust maximum likelihood training of heteroscedastic probabilistic neural networks , 1998, Neural Networks.

[45]  Ivan Bratko,et al.  ASSISTANT 86: A Knowledge-Elicitation Tool for Sophisticated Users , 1987, EWSL.

[46]  R. Galen,et al.  The assessment of laboratory tests in the diagnosis of acute appendicitis. , 1983, American journal of clinical pathology.

[47]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[48]  R C Wade,et al.  Further development of hydrogen bond functions for use in determining energetically favorable binding sites on molecules of known structure. 2. Ligand probe groups with the ability to form more than two hydrogen bonds. , 1993, Journal of medicinal chemistry.

[49]  Léopold Simar,et al.  Computer Intensive Methods in Statistics , 1994 .

[50]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.