LEARNING AND REASONING AS INFORMATION COMPRESSION BY MULTIPLE ALIGNMENT, UNIFICATION AND SEARCH

1 INTRODUCTION This article presents the tentative idea that 'multiple alignment' in a sense which is close to the use of that term in bio-informatics, together with the full or partial merging or 'unification' 1 of patterns, and a process of 'search', is a framework within which learning and reasoning may be integrated. This thinking is part of a programme of research aiming to develop the 'SP' conjecture ('computing as compression') that all kinds of computing and formal reasoning may usefully be understood as information compression by pattern matching, unification and search (PMUS), and to develop a 'new generation' computing system based on the theory [19, 20, 21, 24]. Learning and reasoning are both large subjects. In the space of one short article it is not possible to do more than present a few examples to suggest how these things may be seen in terms of multiple alignment, unification and search. Relevant issues will be discussed more fully elsewhere. Research on 'inductive logic programming' (ILP) is also concerned with learning and reasoning but the focus is different from the SP programme. In ILP (see, for example, [12]) the emphasis is on (supervised) learning within a framework of logic, whereas the SP programme seeks to integrate (unsupervised) learning and reasoning (and other aspects of computing) within a more 1. The term unification is used in this article to mean a simple merging of multiple instances of any pattern to make one. This idea is related to, but simpler than, the concept of 'unification' as it is used in logic.-2-general framework. In this article, the term pattern is used to mean any sequence of atomic symbols including subsequences within a larger sequence where the symbols in the subsequence are not necessarily contiguous within the larger sequence. 2 MULTIPLE ALIGNMENT PROBLEMS The term 'multiple alignment' is normally associated with the computational analysis of DNA sequences or sequences of amino acid residues as part of the process of elucidating the structure, functions or evolution of the corresponding molecules. The general idea is to examine two or more sequences to find one or more alignments of matching bases or amino acid residues which are, in some sense, optimal. An example is presented in Fig. 1.Fig. 1. An alignment amongst five DNA sequences (adapted from Fig. 6 in [13], with permission from Oxford University Press). Intuitively, an 'optimal' or 'good' alignment amongst two or more sequences …

[1]  J. Wolff Towards a theory of cognition and computing , 1991 .

[2]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[3]  W. H. Day,et al.  Critical comparison of consensus methods for molecular sequences. , 1992, Nucleic acids research.

[4]  L. Allison,et al.  Minimum message length encoding and the comparison of macromolecules. , 1990, Bulletin of Mathematical Biology.

[5]  W R Taylor,et al.  Pattern matching methods in protein sequence comparison and structure prediction. , 1988, Protein engineering.

[6]  A. K. Wong,et al.  A survey of multiple sequence comparison methods. , 1992, Bulletin of mathematical biology.

[7]  J. Gerard Wolff,et al.  A scaleable technique for best-match retrieval of sequential information using metrics-guided search , 1994, J. Inf. Sci..

[8]  J. Wolff,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980, Language and speech.

[9]  Mikhail A. Roytberg A search for common patterns in many sequences , 1992, Comput. Appl. Biosci..

[10]  J. Gerard Wolff,et al.  COMPUTING AS COMPRESSION BY MULTIPLE ALIGNMENT, UNIFICATION AND SEARCH (2) , 1995 .

[11]  D. K. Y. Chiu,et al.  A survey of multiple sequence comparison methods , 1992 .

[12]  J. Wolff Learning Syntax and Meanings Through Optimization and Distributional Analysis , 1988 .

[13]  J. Gerard Wolff Computing and Information Compression: A Reply , 1994, AI Commun..

[14]  Jörg-Uwe Kietz,et al.  An Efficient Subsumption Algorithm for Inductive Logic Programming , 1994, ICML.

[15]  G. Barton Protein multiple sequence alignment and flexible pattern matching. , 1990, Methods in enzymology.