Corpus-Based Rules

As we have seen in the previous chapter, we can use rules to select an appropriate tag for each token. We will continue investigating the use of rules in this chapter. However, where in the previous chapter the rules were created manually, based on someone’s linguistic knowledge and familiarity with properties of the corpus, we will explore the possibility of learning tagging rules automatically. A potential advantage of automatic rule learning is that such a system could in theory be highly portable, both across domains and across languages. If training material is available, the systems can be retrained with little or no human intervention. A limitation of this approach is that such systems can only learn facts that can be described within the prespecified descriptive language of the learner, which limits the types of rules that can be learned. For example, a person might discover that a word tends to be tagged with one particular tag when it is toward the end of a sentence. If the learner did not have access to the concept of sentence length and position in a sentence, discovering such a heuristic rule would be beyond the capability of the learning algorithm. One thing that differentiates this approach from other machine learning approaches such as training neural networks (cf. Chapter 17) or hidden Markov models (HMMs; cf. Chapter 16) is that the learned information will be in a form suitable for people to understand, edit, improve, etc., just as is the case for manually written rules.