Locating software bugs is a difficult task, especially if they do not lead to crashes. Current research on automating non-crashing bug detection dictates collecting function call traces and representing them as graphs, and reducing the graphs before applying a subgraph mining algorithm. A ranking of potentially buggy functions is derived using frequency statistics for each node (function) in the correct and incorrect set of traces. Although most existing techniques are effective, they do not achieve scalability. To address this issue, this paper suggests reducing the graph dataset in order to isolate the graphs that are significant in localizing bugs. To this end, we propose the use of tree edit distance algorithms to identify the traces that are closer to each other, while belonging to different sets. The scalability of two proposed algorithms, an exact and a faster approximate one, is evaluated using a dataset derived from a real-world application. Finally, although the main scope of this work lies in scalability, the results indicate that there is no compromise in effectiveness. In this paper, we present a novel approach towards highly scalable Graph Mining solutions for function-level traces. The main contribution lies in the problem formulation, the reduction of the call trace dataset size through different alter- natives, and the construction of a realistic dataset to test upon. Dataset size reduction is confronted using tree edit distance algorithms, while the potential benefits and drawbacks with respect to different solutions are discussed. Furthermore, the applicability of several function-level dynamic bug detection techniques in real applications is discussed and the efficiency and effectiveness of our variations are evaluated against them. Section II of the paper reviews current literature on function-level dynamic bug detection, illustrating the general procedure followed to mine the traces and identify the Graph Mining problems. Section III provides an overview of alterna- tive solutions to known scalability issues. The construction of a realistic dataset that illustrates our contribution is explained in section IV. Finally, our implementation is evaluated in terms of efficiency and effectiveness in section V, while section VI concludes the paper and provides insight for further research.
[1]
Yun Chi,et al.
Indexing and mining free trees
,
2003,
Third IEEE International Conference on Data Mining.
[2]
Michael I. Jordan,et al.
Bug isolation via remote program sampling
,
2003,
PLDI.
[3]
Steven P. Reiss,et al.
Fault localization with nearest neighbor queries
,
2003,
18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings..
[4]
Chao Liu,et al.
Mining Behavior Graphs for "Backtrace" of Noncrashing Bugs
,
2005,
SDM.
[5]
Ian H. Witten,et al.
The WEKA data mining software: an update
,
2009,
SKDD.
[6]
Giuseppe Di Fatta,et al.
Discriminative pattern mining in software fault detection
,
2006,
SOQUA '06.
[7]
Jiawei Han,et al.
CloseGraph: mining closed frequent graph patterns
,
2003,
KDD '03.
[8]
Chao Liu,et al.
SOBER: statistical model-based bug localization
,
2005,
ESEC/FSE-13.
[9]
Michael H. Böhlen,et al.
Approximate Matching of Hierarchical Data Using pq-Grams
,
2005,
VLDB.
[10]
Kuo-Chung Tai,et al.
The Tree-to-Tree Correction Problem
,
1979,
JACM.
[11]
Jiawei Han,et al.
gSpan: graph-based substructure pattern mining
,
2002,
2002 IEEE International Conference on Data Mining, 2002. Proceedings..
[12]
Klemens Böhm,et al.
Mining Edge-Weighted Call Graphs to Localise Software Bugs
,
2008,
ECML/PKDD.
[13]
Kaizhong Zhang,et al.
Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems
,
1989,
SIAM J. Comput..