A Decision Tree Method for Finding and Classifying Names in Japanese Texts

This paper describes a system which uses a decision tree to find and classify names in Japanese texts. The decision tree uses part-of-speech, character type, and special dictionary information to determine the probability that a particular type of name opens or closes at a given position in the text. The output is generated from the consistent sequence of name opens and name closes with the highest probability. This system does not require any human adjustment. Experiments indicate good accuracy with a small amount of training data, and demonstrate the system's portability. The issues of training data size and domain dependency are discussed. 1 I n t r o d u c t i o n For some NLP applications, it is important to identify, "named entities" (NE), such as person names, organization names, time, date, or money expressions in the text. For example, in information extraction systems, it is crucial to identify them in order to provide the knowledge to be extracted, and in machine translation systems, they are useful for creating translations of unknown words or for disambiguation. However, it is not easy to identify these names, because they involve unknown words, and hence the strategy of listing candidates won' t work. Also, it is sometimes hard to determine the category of proper nouns, like distinguishing a person name from a company name. These phenomena are often different from domain to domain. One domain may use a special pattern which is not found in other domains. In this paper, we will present a supervised learning system whicil finds and classifies named entities in Japanese newspaper texts. Recently, several systems have been proposed for this task, but many of them use hand-coded patterns. Cre171 ating these patterns is laborious work, and when we adapt these systems to a new domain or a new definition of named entities, it is likely to need a large amount of additional work. On the other hand, in a supervised learning system, what is needed to adapt the system is to make new training data.. While this is also not a very easy task, it would be easier than creating complicated rules. For example, based on our experience, 100 training articles can be created in a day. There also have been several machine learning systems applied to this task. However, these either 1) partially need hand-made rules, 2) have parameters which must be adjusted by hand, or 3) do not perform well by fully automatic means. Our system does not work fully automatically and also needs special dictionaries, but performs welt and does not have parameters to be adjusted by hand. We will discuss one of the related systems in a later section. The issue of training data size will be discussed based on experiments using different sizes of training data. In order to demonstrate the portability of our system, we ran the system on a new domain with a new type of named entity. The experiment shows that the portability of the system is quite good and the performance is satisfactory.