Syntactic approximation using iterative lexical analysis

Syntactic irregularities, which often occur in source-code undergoing maintenance, prevent the application of analysis and comprehension tools that employ traditional parsing techniques. As an alternative to parsing, we have developed an iterative lexical technique that is based on the repetitive application of regular expressions using a shortest-match strategy. The approach recognizes syntactic elements using iterative refinement, where unambiguous constructs are identified to provide contextual cues for the identification of more ambiguous constructs. The use of a shortest-match strategy supports the bottom up construction of a syntax tree by identifying smaller subtrees first. To examine the technique's effectiveness, we present the results of an experiment comparing iterative lexical analysis against parsing. The measures of precision and recall are used to evaluate and compare the two approaches.

[1]  RAINER KOPPLER A Systematic Approach to Fuzzy Parsing , 1997, Softw. Pract. Exp..

[2]  James R. Cordy,et al.  TXL: A Rapid Prototyping System for Programming Language Dialects , 1991, Comput. Lang..

[3]  David Notkin,et al.  Lightweight lexical source model extraction , 1996, TSEM.

[4]  Lisa F. Rau,et al.  Innovations in Text Interpretation , 1993, Artif. Intell..

[5]  Charles L. A. Clarke,et al.  On the use of regular expressions for searching text , 1997, TOPL.

[6]  Alan Bundy,et al.  Breadth-First Parsing , 1984 .

[7]  Premkumar T. Devanbu,et al.  Generating testing and analysis tools with Aria , 1996, TSEM.

[8]  Charles L. A. Clarke,et al.  A source-based approach to representing and managing information extracted by program analysis , 2002 .

[9]  Chris Verhoef,et al.  Current parsing techniques in software renovation considered harmful , 1998, Proceedings. 6th International Workshop on Program Comprehension. IWPC'98 (Cat. No.98TB100242).

[10]  Stéphane S. Somé,et al.  Parsing minimization when extracting information from code in the presence of conditional compilation , 1998, Proceedings. 6th International Workshop on Program Comprehension. IWPC'98 (Cat. No.98TB100242).

[11]  Alfred V. Aho,et al.  Awk — a pattern scanning and processing language , 1979, Softw. Pract. Exp..

[12]  David Notkin,et al.  Lightweight source model extraction , 1995, SIGSOFT '95.

[13]  William G. Griswold,et al.  Fast, flexible syntactic pattern matching and processing , 1996, WPC '96. 4th Workshop on Program Comprehension.

[14]  Charles L. A. Clarke,et al.  Representing and accessing extracted information , 2001, Proceedings IEEE International Conference on Software Maintenance. ICSM 2001.

[15]  Charles L. A. Clarke,et al.  A comparative evaluation of techniques for syntactic level source code analysis , 2000, Proceedings Seventh Asia-Pacific Software Engeering Conference. APSEC 2000.

[16]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[17]  J. Christopher Ramming,et al.  A*: a language for implementing language processors , 1994, Proceedings of 1994 IEEE International Conference on Computer Languages (ICCL'94).

[18]  Leon Moonen Lightweight impact analysis using island grammars , 2002, Proceedings 10th International Workshop on Program Comprehension.

[19]  Steven M. Kearns Tlex , 1991, Softw. Pract. Exp..

[20]  E. Schmidt,et al.  Lex—a lexical analyzer generator , 1990 .

[21]  Leon Moonen,et al.  Generating robust parsers using island grammars , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[22]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[23]  Rainer Koppler A Systematic Approach to Fuzzy Parsing , 1997 .

[24]  Murray Hill,et al.  Yacc: Yet Another Compiler-Compiler , 1978 .

[25]  Yih-Farn Robin Chen,et al.  Incl: A Tool to Analyze Include Files , 1992, USENIX Summer.

[26]  Walter R. Bischofberger,et al.  Sniff—A Pragmatic Approach to a C++ Programming Environment 1 , 1992 .

[27]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.