Our goal in this paper is to achieve automatic extraction and classification of key phrases from software development documents, such as requirements, specifications, and so on. In software development projects, creating dictionaries is important for defining the terminologies used to enable accurate communication between customers and vendors, as well as among developers. However, each target domain, such as a medical, financial, transportational, or other field, has its own particular terminology; moreover, each customer employs its own terms and their respective meanings. Building a dictionary of a target domain requires experts’ knowledge in the given domain and considerable effort. To assist in dictionary building, we are developing a software document terminology recognizer (SDTR) with the use of named entity recognition (NER) methods. A significant amount of research exists on NER; however, most of it is focused on general named entities, such as person names, or biological domain named entities, such as names of compounds. However, the problem of building effective entity recognizers in a new domain where you have very little supervised data available is very understudied. There are a lot of small domains each of them has different terminology because software is used in various domains and organizations. Also it is impractical to build taggers by traditional supervised NER methods for SDTR because the tuning cost in individual software development projects is limited. Building method of an SDTR should cover cross-domain terminologies using small size of corpus; nevertheless, an SDTR must cope with very specific terminologies for individual projects. In this paper, we propose a multi-layered SDTR system consisting of an identifier that uses general features based on the probability of phrases and spelling conventions, and an identifier that employs a temporary dictionary automatically built into the general feature identifier. Currently, our prototype achieves a greater than 0.8 F1-value on a small software development project corpus.
[1]
J. Dean,et al.
Efficient Estimation of Word Representations in Vector Space
,
2013,
ICLR.
[2]
Dan Roth,et al.
Design Challenges and Misconceptions in Named Entity Recognition
,
2009,
CoNLL.
[3]
Ulf Leser,et al.
What makes a gene name? Named entity recognition in the biomedical literature
,
2005,
Briefings Bioinform..
[4]
Christopher D. Manning,et al.
Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling
,
2005,
ACL.
[5]
Stephen E. Robertson,et al.
Simple BM25 extension to multiple weighted fields
,
2004,
CIKM '04.
[6]
Andrew McCallum,et al.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
,
2001,
ICML.
[7]
Robert L. Mercer,et al.
Class-Based n-gram Models of Natural Language
,
1992,
CL.
[8]
Satoshi Sekine,et al.
A survey of named entity recognition and classification
,
2007
.