On the diversity and frequency of code related to mathematical formulas in real-world Java projects

Abstract In this paper, the term formula code refers to fragments of source code that implement a mathematical formula. We present empirical studies that analyze the diversity and frequency of formula code in open-source-software projects. In an exploratory study, we investigated what kinds of formulas are implemented in real-world Java projects and derived syntactical patterns and constraints. We refined these patterns for sum and product formulas to automatically detect formula code in software archives and to reconstruct the implemented formula in mathematical notation. In a quantitative study of a large sample of engineered Java projects on GitHub we analyzed the frequency of formula code and estimated that one of 700 lines of code in this sample implements a sum or product formula. For a sample of scientific-computing projects, we found that one of 100 lines of code implements a sum or product formula. To assess the need for tool support, we investigated the helpfulness of comments for program understanding in a sample of formula-code fragments and performed an online survey. Our findings provide first insights into the characteristics of formula code, that can motivate further studies on the role of formula code in software projects and the design of formula-related tools.

[1]  Carlo Ghezzi,et al.  An empirical investigation into a large-scale Java open source code repository , 2010, ESEM '10.

[2]  Stephan Diehl,et al.  Identifying Refactorings from Source-Code Changes , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[3]  Richard Zanibbi,et al.  Recognition and retrieval of mathematical expressions , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[4]  Stephan Diehl,et al.  Visual Breakpoint Debugging for Sum and Product Formulae , 2020, 2020 Working Conference on Software Visualization (VISSOFT).

[5]  Peter B. Henderson Mathematical reasoning in software engineering education , 2003, CACM.

[6]  Josef Pichler,et al.  RbG: A documentation generator for scientific and engineering software , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[7]  Thomas Zimmermann,et al.  Mining eclipse for cross-cutting concerns , 2006, MSR '06.

[8]  C. Brodsky The Discovery of Grounded Theory: Strategies for Qualitative Research , 1968 .

[9]  Jorma Sajaniemi,et al.  An empirical analysis of roles of variables in novice-level procedural programs , 2002, Proceedings IEEE 2002 Symposia on Human Centric Computing Languages and Environments.

[10]  Frank Wm. Tompa,et al.  Retrieving documents with mathematical content , 2013, SIGIR.

[11]  Dit-Yan Yeung,et al.  Mathematical expression recognition: a survey , 2000, International Journal on Document Analysis and Recognition.

[12]  Meiyappan Nagappan,et al.  Curating GitHub for engineered software projects , 2017, Empirical Software Engineering.

[13]  Michael Levison Editing mathematical formulae , 1983, Softw. Pract. Exp..

[14]  Atanas Rountev,et al.  Precise identification of side-effect-free methods in Java , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[15]  Florian Cajori,et al.  A history of mathematical notations , 1928 .

[16]  Hridesh Rajan,et al.  Boa: A language and infrastructure for analyzing ultra-large-scale software repositories , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[17]  Anthony Di Franco,et al.  A comprehensive study of real-world numerical bug characteristics , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Andreas Zeller,et al.  Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[19]  Josef Pichler,et al.  Documentation generation from annotated source code of scientific software: position paper , 2016, SE4Science@ICSE.

[20]  Rahul Purandare,et al.  A Search System for Mathematical Expressions on Software Binaries , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[21]  David Landy,et al.  A perceptual account of symbolic reasoning , 2014, Front. Psychol..

[22]  Brian W. Kernighan,et al.  A system for typesetting mathematics , 1975, Commun. ACM.

[23]  Jian Pei,et al.  MAPO: mining API usages from open source repositories , 2006, MSR '06.

[24]  Lauri Malmi,et al.  Recognizing Algorithms Using Language Constructs, Software Metrics and Roles of Variables: An Experiment with Sorting Algorithms , 2011, Comput. J..