Software Document Terminology Recognition

Our goal in this paper is to achieve automatic extraction and classification of key phrases from software development documents, such as requirements, specifications, and so on. In software development projects, creating dictionaries is important for defining the terminologies used to enable accurate communication between customers and vendors, as well as among developers. However, each target domain, such as a medical, financial, transportational, or other field, has its own particular terminology; moreover, each customer employs its own terms and their respective meanings. Building a dictionary of a target domain requires experts’ knowledge in the given domain and considerable effort. To assist in dictionary building, we are developing a software document terminology recognizer (SDTR) with the use of named entity recognition (NER) methods. A significant amount of research exists on NER; however, most of it is focused on general named entities, such as person names, or biological domain named entities, such as names of compounds. However, the problem of building effective entity recognizers in a new domain where you have very little supervised data available is very understudied. There are a lot of small domains each of them has different terminology because software is used in various domains and organizations. Also it is impractical to build taggers by traditional supervised NER methods for SDTR because the tuning cost in individual software development projects is limited. Building method of an SDTR should cover cross-domain terminologies using small size of corpus; nevertheless, an SDTR must cope with very specific terminologies for individual projects. In this paper, we propose a multi-layered SDTR system consisting of an identifier that uses general features based on the probability of phrases and spelling conventions, and an identifier that employs a temporary dictionary automatically built into the general feature identifier. Currently, our prototype achieves a greater than 0.8 F1-value on a small software development project corpus.