Deep Learning for Natural Language Processing

In this chapter, we set up the fundamental framework for the book. We first provide an introduction to the basics of natural language processing (NLP) as an integral part of artificial intelligence. We then survey the historical development of NLP, spanning over five decades, in terms of three waves. The first two waves arose as rationalism and empiricism, paving ways to the current deep learning wave. The key pillars underlying the deep learning revolution for NLP consist of (1) distributed representations of linguistic entities via embedding, (2) semantic generalization due to the embedding, (3) long-span deep sequence modeling of natural language, (4) hierarchical networks effective for representing linguistic levels from low to high, and (5) end-to-end deep learning methods to jointly solve many NLP tasks. After the survey, several key limitations of current deep learning technology for NLP are analyzed. This analysis leads to five research directions for future advances in NLP. 1.1 Natural Language Processing: The Basics Natural language processing (NLP) investigates the use of computers to process or to understand human (i.e., natural) languages for the purpose of performing useful tasks. NLP is an interdisciplinary field that combines computational linguistics, computing science, cognitive science, and artificial intelligence. From a scientific perspective, NLP aims to model the cognitive mechanisms underlying the understanding and production of human languages. From an engineering perspective, NLP is concerned with how to develop novel practical applications to facilitate the interactions between computers and human languages. Typical applications in NLP include speech recognition, spoken language understanding, dialogue systems, lexical analysis, parsing, machine translation, knowledge graph, information retrieval, question answering, L. Deng (B) Citadel, Seattle & Chicago, USA e-mail: l.deng@ieee.org Y. Liu Tsinghua University, Beijing, China e-mail: liuyang2011@tsinghua.edu.cn © Springer Nature Singapore Pte Ltd. 2018 L. Deng and Y. Liu (eds.), Deep Learning in Natural Language Processing, https://doi.org/10.1007/978-981-10-5209-5_1 1 2 L. Deng and Y. Liu sentiment analysis, social computing, natural language generation, and natural language summarization. These NLP application areas form the core content of this book. Natural language is a systemconstructed specifically to conveymeaning or semantics, and is by its fundamental nature a symbolic or discrete system. The surface or observable “physical” signal of natural language is called text, always in a symbolic form. The text “signal” has its counterpart—the speech signal; the latter can be regarded as the continuous correspondence of symbolic text, both entailing the same latent linguistic hierarchy of natural language. FromNLP and signal processing perspectives, speech can be treated as “noisy” versions of text, imposing additional difficulties in its need of “de-noising” when performing the task of understanding the common underlying semantics. Chapters2 and 3 as well as current Chap. 1 of this book cover the speech aspect of NLP in detail, while the remaining chapters start directly from text in discussing a wide variety of text-oriented tasks that exemplify the pervasive NLP applications enabled by machine learning techniques, notably deep learning. The symbolic nature of natural language is in stark contrast to the continuous nature of language’s neural substrate in the human brain.Wewill defer this discussion to Sect. 1.6 of this chapter when discussing future challenges of deep learning inNLP. A related contrast is how the symbols of natural language are encoded in several continuous-valued modalities, such as gesture (as in sign language), handwriting (as an image), and, of course, speech. On the one hand, the word as a symbol is used as a “signifier” to refer to a concept or a thing in real world as a “signified” object, necessarily a categorical entity. On the other hand, the continuous modalities that encode symbols of words constitute the external signals sensed by the human perceptual system and transmitted to the brain, which in turn operates in a continuous fashion. While of great theoretical interest, the subject of contrasting the symbolic nature of language versus its continuous rendering and encoding goes beyond the scope of this book. In the next few sections, we outline and discuss, from a historical perspective, the development of general methodology used to study NLP as a rich interdisciplinary field. Much like several closely related suband super-fields such as conversational systems, speech recognition, and artificial intelligence, the development of NLP can be described in terms of three major waves (Deng 2017; Pereira 2017), each of which is elaborated in a separate section next. 1.2 The First Wave: Rationalism NLP research in its first wave lasted for a long time, dating back to 1950s. In 1950, AlanTuring proposed theTuring test to evaluate a computer’s ability to exhibit intelligent behavior indistinguishable from that of a human (Turing 1950). This test is based on natural language conversations between a human and a computer designed to generate human-like responses. In 1954, theGeorgetown-IBMexperiment demonstrated 1 A Joint Introduction to Natural Language Processing and to Deep Learning 3 the first machine translation system capable of translating more than 60 Russian sentences into English. The approaches, based on the belief that knowledge of language in the human mind is fixed in advance by generic inheritance, dominated most of NLP research between about 1960 and late 1980s. These approaches have been called rationalist ones (Church 2007). The dominance of rationalist approaches in NLP was mainly due to the widespread acceptance of arguments of Noam Chomsky for an innate language structure and his criticism of N-grams (Chomsky 1957). Postulating that key parts of language are hardwired in the brain at birth as a part of the human genetic inheritance, rationalist approaches endeavored to design hand-crafted rules to incorporate knowledge and reasoning mechanisms into intelligent NLP systems. Up until 1980s, most notably successful NLP systems, such as ELIZA for simulating a Rogerian psychotherapist andMARGIE for structuring real-world information into concept ontologies, were based on complex sets of handwritten rules. This period coincided approximately with the early development of artificial intelligence, characterized by expert knowledge engineering, where domain experts devised computer programs according to the knowledge about the (very narrow) application domains they have (Nilsson 1982; Winston 1993). The experts designed these programs using symbolic logical rules based on careful representations and engineering of such knowledge. These knowledge-based artificial intelligence systems tend to be effective in solving narrow-domain problems by examining the “head” or most important parameters and reaching a solution about the appropriate action to take in each specific situation. These “head” parameters are identified in advance by human experts, leaving the “tail” parameters and cases untouched. Since they lack learning capability, they have difficulty in generalizing the solutions to new situations and domains. The typical approach during this period is exemplified by the expert system, a computer system that emulates the decision-making ability of a human expert. Such systems are designed to solve complex problems by reasoning about knowledge (Nilsson 1982). The first expert system was created in 1970s and then proliferated in 1980s. The main “algorithm” used was the inference rules in the form of “if-then-else” (Jackson 1998). The main strength of these first-generation artificial intelligence systems is its transparency and interpretability in their (limited) capability in performing logical reasoning. Like NLP systems such as ELIZA and MARGIE, the general expert systems in the early days used hand-crafted expert knowledge which was often effective in narrowly defined problems, although the reasoning could not handle uncertainty that is ubiquitous in practical applications. In specificNLP application areas of dialogue systems and spoken language understanding, to be described in more detail in Chaps. 2 and 3 of this book, such rationalistic approaches were represented by the pervasive use of symbolic rules and templates (Seneff et al. 1991). The designs were centered on grammatical and ontological constructs, which, while interpretable and easy to debug and update, had experienced severe difficulties in practical deployment. When such systems worked, they often worked beautifully; but unfortunately this happened just not very often and the domains were necessarily limited. 4 L. Deng and Y. Liu Likewise, speech recognition research and system design, another long-standing NLP and artificial intelligence challenge, during this rationalist era were based heavily on the paradigm of expert knowledge engineering, as elegantly analyzed in (Church and Mercer 1993). During 1970s and early 1980s, the expert system approach to speech recognition was quite popular (Reddy 1976; Zue 1985). However, the lack of abilities to learn from data and to handle uncertainty in reasoningwas acutely recognized by researchers, leading to the second wave of speech recognition, NLP, and artificial intelligence described next. 1.3 The Second Wave: Empiricism The second wave of NLP was characterized by the exploitation of data corpora and of (shallow) machine learning, statistical or otherwise, to make use of such data (Manning and Schtze 1999). As much of the structure of and theory about natural language were discounted or discarded in favor of data-driven methods, the main approaches developed during this era have been called empirical or pragmatic ones (Church andMercer 1993;Church2014).With the increasing availability ofmac