A weighted nearest neighbor algorithm for learning with symbolic features

In the past, nearest neighbor algorithms for learning from examples have worked best in domains in which all features had numeric values. In such domains, the examples can be treated as points and distance metrics can use standard definitions. In symbolic domains, a more sophisticated treatment of the feature space is required. We introduce a nearest neighbor algorithm for learning in domains with symbolic features. Our algorithm calculates distance tables that allow it to produce real-valued distances between instances, and attaches weights to the instances to further modify the structure of feature space. We show that this technique produces excellent classification accuracy on three problems that have been studied by machine learning researchers: predicting protein secondary structure, identifying DNA promoter sequences, and pronouncing English text. Direct experimental comparisons with the other learning algorithms show that our nearest neighbor algorithm is comparable or superior in all three domains. In addition, our algorithm has advantages in training speed, simplicity, and perspicuity. We conclude that experimental evidence favors the use and continued development of nearest neighbor algorithms for domains such as the ones studied here.

[1]  J. W. Jenkins,et al.  THE UNIVERSITY OF WISCONSIN. , 1905, Science.

[2]  Stephen K. Reed,et al.  Pattern recognition and categorization , 1972 .

[3]  V. Lim Algorithms for prediction of alpha-helical and beta-structural regions in globular proteins. , 1974, Journal of molecular biology.

[4]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[5]  Douglas L. Medin,et al.  Context theory of classification learning. , 1978 .

[6]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[7]  J. Garnier,et al.  Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. , 1978, Journal of molecular biology.

[8]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[9]  R. Nosofsky American Psychological Association, Inc. Choice, Similarity, and the Context Theory of Classification , 2022 .

[10]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[11]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[12]  James L. McClelland,et al.  A distributed model of human learning and memory , 1986 .

[13]  R. M. Abarbanel,et al.  Turn prediction in proteins using a pattern-matching approach. , 1986, Biochemistry.

[14]  Geoffrey E. Hinton,et al.  Schemata and Sequential Thought Processes in PDP Models , 1986 .

[15]  F. H. C. Crick,et al.  Certain aspects of the anatomy and physiology of the cerebral cortex , 1986 .

[16]  Richard H. Lathrop,et al.  ARIADNE: pattern-directed inference and hierarchical abstraction in protein structure recognition , 1987, CACM.

[17]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[18]  Raymond J. Mooney,et al.  An Experimental Comparison of Symbolic and Connectionist Learning Algorithms , 1989, IJCAI.

[19]  M. Karplus,et al.  Protein secondary structure prediction with a neural network. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[20]  David W. Aha,et al.  Incremental, Instance-Based Learning of Independent and Graded Concept Descriptions , 1989, ML.

[21]  David W. Aha,et al.  Noise-Tolerant Instance-Based Learning Algorithms , 1989, IJCAI.

[22]  Sholom M. Weiss,et al.  An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods , 1989, IJCAI.

[23]  Douglas H. Fisher,et al.  An Empirical Comparison of ID3 and Back-propagation , 1989, IJCAI.

[24]  Steven Salzberg,et al.  Nested Hyper-Rectangles for Exemplar-Based Learning , 1989, AII.

[25]  M. O'Neill Escherichia coli promoters. I. Consensus as it relates to spacing class, specificity, repeat substructure, and three-dimensional organization. , 1989, The Journal of biological chemistry.

[26]  Steven L. Salzberg,et al.  Exemplar-Based Learning to Predict Protein Folding , 1990 .

[27]  Jude Shavlik,et al.  Refinement ofApproximate Domain Theories by Knowledge-Based Neural Networks , 1990, AAAI.

[28]  Stephen José Hanson,et al.  What connectionist models learn: Learning and representation in connectionist networks , 1990, Behavioral and Brain Sciences.

[29]  Thomas G. Dietterich,et al.  A Comparative Study of ID3 and Backpropagation for English Text-to-Speech Mapping , 1990, ML.

[30]  David Gelernter,et al.  FGP: A Virtual Machine for Acquiring Knowledge from Cases , 1991, IJCAI.

[31]  David L. Waltz Massively Parallel AI , 1993, Int. J. High Speed Comput..

[32]  Nicholas Kalouptsidis,et al.  Nearest neighbor pattern classification neural networks , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[33]  S. Salzberg A nearest hyperrectangle learning method , 2004, Machine Learning.

[34]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[35]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[36]  University of Wisconsin Football , 2005 .