Language engineering for the recovery of requirements from legacy documents

Legacy documents, such as requirements documents or manuals of business procedures, can sometimes offer an important resource for informing what features of legacy software are redundant, need to be retained or can be reused. This situation is particularly acute where business change has resulted in the dissipation of human knowledge through staff turnover or redeployment. Exploiting legacy documents poses formidable problems, however, since they are often incomplete, poorly structured, poorly maintained and voluminous. This report proposes that language engineering using tools that exploit probabilistic natural language processing (NLP) techniques offer the potential to ease these problems. Such tools are available, mature and have been proven in other domains. The document provides a review of NLP and a discussion of the components of probabilistic NLP techniques and their potential for requirements recovery from legacy documents. The report concludes with a summary of the preliminary results of the adaptation and application of these techniques in the REVERE project.

[1]  G. Leech,et al.  Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus , 1997 .

[2]  Michael Halliday,et al.  Cohesion in English , 1976 .

[3]  Roger Garside The robust tagging of unrestricted text: the BNC experience , 1996 .

[4]  Martin Loomes,et al.  Requirements evolution in the midst of environmental change: a managed approach , 1998, Proceedings of the Second Euromicro Conference on Software Maintenance and Reengineering.

[5]  Terry Winograd,et al.  Language as a Cognitive Process , 1983, CL.

[6]  Ian Sommerville,et al.  Managing Process Inconsistency Using Viewpoints , 1999, IEEE Trans. Software Eng..

[7]  Geoffrey Leech,et al.  Using corpora for language research : studies in the honour of Geoffrey Leech , 1996 .

[8]  Paul Rayson,et al.  Template analysis: bridging the gap between grammar and the lexicon , 1996 .

[9]  Sylviane Granger,et al.  Automatic Profiling of Learner Texts , 1998 .

[10]  Paul Rayson,et al.  The ACAMRIT semantic tagging system: progress report , 1996 .

[11]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[12]  Richard Jones Creating and using a corpus of spoken German , 1997 .

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Kevin Ryan,et al.  The role of natural language in requirements engineering , 1993, [1993] Proceedings of the IEEE International Symposium on Requirements Engineering.

[15]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[16]  Richard Kittredge,et al.  Sublanguage : studies of language in restricted semantic domains , 1982 .

[17]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[18]  Naomi Sager,et al.  Chapter 2. Automatic Information Formatting of a Medical Sublanguage , 1982 .

[19]  John B. Carroll,et al.  The American Heritage Word Frequency Book , 1971 .

[20]  Alphonse G. Juilland,et al.  Frequency dictionary of French words , 1971 .

[21]  I D Bross,et al.  How information is carried in scientific sub-languages. , 1972, Science.

[22]  Ian Marshall,et al.  Choice of grammatical word-class without global syntactic analysis: Tagging words in the lob corpus , 1983, Comput. Humanit..

[23]  Sylviane Granger,et al.  Learner English on Computer , 1998 .

[24]  S. Fligelstone,et al.  Developing a scheme for annotating text to show anaphoric relations , 1992 .

[25]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[26]  Heles Contreras,et al.  Frequency Dictionary of Spanish Words , 1964 .

[27]  Galal H Galal,et al.  Requirements engineering: A good practice , 2000 .

[28]  Mark Rouncefield,et al.  Never mind the ethno' stuff, what does all this mean and what do we do now: ethnography in the commercial world , 1997, INTR.

[29]  S. Johansson,et al.  Word Frequencies in British and American English , 1985 .

[30]  Walter Daelemans,et al.  Rapid Development of NLP Modules with Memory-based Learning , 1998 .

[31]  Geoffrey Leech,et al.  Corpus Annotation: Linguistic Information from Computer Text Corpora , 1997 .

[32]  Geoffrey K. Pullum,et al.  Natural languages and context-free languages , 1982 .

[33]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[34]  Ian Sommerville,et al.  Viewpoints for requirements elicitation: a practical approach , 1998, Proceedings of IEEE International Symposium on Requirements Engineering: RE '98.

[35]  Alphonse G. Juilland,et al.  Frequency dictionary of Rumanian words , 1964 .

[36]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[37]  Hanifa Shah,et al.  Technical Opinion: Viewpoints on legacy systems , 1999, CACM.

[38]  Geoffrey Leech,et al.  Introducing corpus annotation , 1997 .

[39]  Keith A. Butler,et al.  Connecting the design of software to the design , 1999, CACM.

[40]  Jfm Burg,et al.  COLOR-X: Object Modeling Profits from Linguistics , 1995 .

[41]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[42]  Tadao Kasami,et al.  A translation method from natural language specifications into formal specifications using contextual dependencies , 1993, [1993] Proceedings of the IEEE International Symposium on Requirements Engineering.

[43]  Anne Wichmann,et al.  Teaching and Language Corpora , 1997 .

[44]  Ani Thakar,et al.  Generating Validation Feedback for Automatic Interpretation of Informal Requirements , 1997, Formal Methods Syst. Des..

[45]  Paul Rayson,et al.  How to generalise the task of annotation , 1997 .

[46]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.