Taking Turing Seriously (but Not Literally)

Recent results from present-day instantiations of the Turing test, most notably the annual Loebner Prize competition, have fueled the perception that the test has either already been passed or that it soon will be. With this perception comes the implication that computers are on the verge of achieving human-level cognition. We argue that the perspective on the Turing test is flawed because it lacks an essential component: that of unbounded creativity on behalf of the questioner, which in turn induces a cooperative responsibility on behalf of the respondent. We discuss how the decades-spanning program of activity at Indiana University’s Center for Research on Concepts and Cognition represents one (perhaps unique) approach to taking Turing seriously in this respect. 1 | Introduction: The Turing Test in Letter and Spirit If Turing were alive today, what would he think about the Turing test? Would he still consider his “imitation game” an effective means of gauging intelligence, given what we now know about the Eliza effect, chatbots, and the increasing vacuity of interpersonal communication in the age of texting and instant messaging? Alas, one can only speculate, but we suspect he would find the prevailing interpretation of his classic “Computing Machinery and Intelligence” (1950) to be disappointingly literal-minded. Current instantiations of his eponymous test, most notably the annual Loebner Prize competition, adhere closely to the outward form—or letter—of the imitation game he proposed. However, it is questionable how faithful such competitions are to the underlying purpose—or spirit—of the game. That purpose, lest we forget, is to assess whether a given program demonstrates intelligence 1 --and if so, to what extent. But this purpose gets obscured when the emphasis turns from modeling intelligence to simply “beating the test”. The question, then, is whether we might better capture the spirit of Turing’s test through other, less literal-minded means. Our answer is not only that we can, but that we must, at least if the Turing test is going to remain a useful standard-bearer for assessing progress in AI. The alternative is to risk trivializing the Turing test by equating “intelligence” with the ability to mimic the sort of context-neutral conversation that has increasingly come to pass for “communication.” This perspective favors methodologies such as statistical machine learning that we claim are better suited to modeling human banality than human intelligence. It is actually our belief that such methodologies will ultimately fail even at this more humble goal. Our essential claim could be summarized as follows: “Unless we can work out how to build a genius, we won’t even be able to build an idiot.” The motivating perspective for this statement is that cognitive mechanisms such as metacognition and domain-assisted perception are actually vital for generating the utterances of both the insightful and the ignorant, but the nature of these mechanisms is thrown into sharp relief by the subtle and economical insights afforded to “genius.” 2 | Intelligence, Trickery, and the Loebner Prize Does the ability to deceive others presuppose intelligence, or merely a knack for deception? In proposing his imitation game, Turing wagered that the two were inseparable. However, as Shieber (1994) observes, “[I]t has been known since Weizenbaum’s surprising experiences with ELIZA that a test based on fooling people is confoundingly simple to pass” (p. 72; cf. Weizenbaum 1976). The gist of Weizenbaum’s realization is that our interactions with computer programs often tell us less about the inner workings of the programs themselves than they do about our tendency to project meaning and intention onto artifacts, even when we should know better. A Parallel Case: Art Forgery For another perspective on distinction between genuine accomplishment and mere trickery, let us consider the parallel case of art forgery. Is it possible to draw the distinction between a genuine artist, on the one hand, and a mere faker, on the other? It is tempting to reply that in order to be a good faker—one good enough to “fool the experts”—one must necessarily be a good artist to begin with. But this sort of argument is too simplistic, as it equates artistic quality with technical skill and prowess, meanwhile devaluing the role of originality, artistic vision, and other qualities that we typically associate with genuine artistry (cf. Lessing 1965; Dutton 1979). In particular, 1 Note that we are content to restrict our concern to a test for humanocentric intelligence; see French (1990) for a discussion on this issue. the ability of a skilled art forger to create a series of works in the style of, say, Matisse does not imply insight into the underlying artistic or expressive vision of Matisse. In other words, “There is all the difference in the world between a painting that genuinely reveals qualities of mind to us and one which blindly apes their outward show” (Kieran 2005, p. 21). Russell’s famous quote (above) about specification equating to theft helps us relate an AI methodology to the artistry–forgery distinction. Russell’s statement can be paraphrased as follows: merely saying that there exists a function (e.g., sqrt()) with some property (e.g., sqrt(x)*sqrt(x)=x for all x >= 0) does not tell you very much about how to generate the actual sqrt() function. Similarly, the ability to reproduce a small number of values of x that meet this specification does not imply insight into the underlying mechanisms of which the existence of these specific values is essentially a side effect. A key issue here is the small number of values: Since contemporary versions of the Turing test are generally highly time-constrained, this makes it even more imperative that the test involve a deep probe into the possible behaviors of the respondent. Many of the Loebner Prize entrants have adopted the methodologies of corpus linguistics and machine learning (ML), so we will re-frame the issue of thematic variability in these terms. We might abstractly consider the statistical ML approach to the Turing test as being concerned with the induction of a generative grammar. The ability to induce an algorithm that reproduces some themed collection of original works does not in itself imply that any underlying sensibilities that motivated those works can be effectively approximated by that algorithm. One way of measuring the “work capacity” of an algorithm is to employ the Kolmogorov complexity measure (Solomonoff 1964), which is essentially the size of the shortest possible functionally identical algorithm. In the induction case, algorithms with the lowest Kolmogorov complexity will tend to be those that exhibit very little variability—in the limiting case, generating only instances from the original collection. (This would be analogous to a forger who could only produce exact copies of another artist’s works, rather than works “in the style of” said artist.) In contrast, programs from the Fluid Analogies family of architectures possess domainspecific relational and generative models. For example, the Letter Spirit architecture (Rehling 2001) is specifically concerned with exploring the thematic variability of a given font style. Given Letter Spirit’s sophisticated representation of the “basis elements” and “recombination mechanisms” of form, it might reasonably be expected to have high Kolmogorov complexity. The thematic variations generated by Letter Spirit are therefore not easily approximated by domain-agnostic data-mining approaches. The artistry–forgery distinction is useful in so far as it offers another perspective on the issue of depth versus shallowness—an issue that is crucial in any analysis of the Turing test. Likewise, just as the skilled art forger is adept at using trickery to simulate authenticity—for example, by artificially “aging” a painting through various techniques such as baking or varnishing (Werness 1983)—similar forms of trickery often find their way into the Loebner Prize competition: timely pop-culture references, intentional “typos” and misspellings, and so on (cf. Shieber 1994; Christian 2011). Yet these surface-level tricks have as much to do with the modeling of intelligence as coating the surface of a painting with antique varnish has to do with bona fide artistry. This essentially adversarial approach is a driving force in the divergence of current instantiations of the Turing test from the spirit of the test as it was originally conceived. It is our contention that a test that better meets the original intent should be driven by the joint aims of creativity and collaboration. 3 | Taking Turing Seriously: An Alternative Approach In order to emphasize the role of “unbounded creativity” in the evaluation of intelligence, we describe a Feigenbaum test—essentially a “micro-Turing-test” (Feigenbaum, 2003)—restricted to the microdomain of analogies between letter-strings. For example, “If abc changes to abd, how would you change pxqxrx in ‘the same way’?” (or simply abc → abd; pxqxrx → ???, to use a bit of convenient shorthand). Problems in this domain have been the subject of extensive study (Hofstadter et al. 1995), resulting in the creation of the well-known Copycat model (Mitchell 1993) and its successor, Metacat (Marshall 1999). Although apparently highly restricted, problems in this domain can nonetheless exhibit surprising subtleties. We proceed to give some examples, described concretely in terms of the mechanisms of Copycat and Metacat, which are two instantiations of what we refer to more broadly as Fluid Concepts architectures. Copycat, Metacat, and Fluid Concepts Architectures Copycat’s architecture consists of three main components, all of which are common of the more general Fluid Concepts architectural scheme. These components are the Workspace, which is essentially roughly the program’s working memory; the Slipnet, a conceptual network wi