Introducing computational linguistics with NLTK (Natural Language Toolkit)

As computers have become more powerful, leading to software that processes very large bodies of text, some linguists have come to rely on computers in their work. Similarly, as the web has become a central source of information, computer scientists have begun to rely on linguistic theory and practice in processing massive data sets. These interdisciplinary efforts have themselves led to interesting and novel problems that face both disciplines, and as a result there are opportunities for computer science graduates to work on computational linguistics problems and for computer scientists to collaborate with linguists on research and education. Further, it appears that 1) students with little or no background in mathematics or programming often have difficulty with the abstract (or computational) thinking that is prerequisite for learning to program, and 2) even students who have already chosen computer science as their major field often have difficulty maintaining motivation and would benefit from working with "real-world" problems and large programs relatively early in their college career. Further, 3) many faculty would like to use and contribute to repositories of curricular materials, and 4) students (and faculty) usually like (and often need) to use open-source, free, and reliable software resources for their studies. Working on computational linguistics problems in computer science classes using the Natural Language Toolkit (NLTK) provides one way to address the above four issues: 1) Computational linguistics provides a way for students with little mathematical preparation to segue, first into analytical thinking about something most of them know well (the English language) and then into thinking about higher level abstractions. 2) The NLTK provides computer science students with several examples of real-world problems and an experience of working with a large application program, both of which can provide motivation to students who worry that their chosen field of study is not relevant to real-world problems. 3) The NLTK web site contains a large number of course materials, albeit most at the upper division undergraduate or graduate level. 4) The NLTK and the NLTK textbook are available as open source software and creative commons licenses (respectively); there is no charge for their use. The Natural Language Toolkit is composed of open source Python modules that run on multiple platforms, linguistic data (~60 corpora, toy grammars, trained models, etc.), and documentation for research and development in natural language processing and computational linguistics, including a textbook Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper. In spite of the availability of these materials, and the fact that this educator found them reasonably reliable, getting started with new software, and designing activities for students that fit an undergraduate curriculum can be a challenge. This workshop is intended to introduce computer science faculty to NLTK, and will be conducted by a computer scientist and linguist who collaborated on a CS0 computational linguistics class Fall 2008. Workshop facilitators will briefly explain their experience using NLTK, and then participants will work with a partner in the lab to complete an Introduction to NLTK. This introduction should be sufficient for the participant to work through additional exercises from the facilitators' labs or start developing their own labs.