Generating Data Analysis Programs from Statistical Models

Extracting information from data, often also called data analysis, is an important scientific task. Statistical approaches, which use methods from probability theory and numerical analysis, are well-founded but difficult to implement: the development of a statistical data analysis program for any given application is time-consuming and requires knowledge and experience in several areas. In this paper, we describe AUTOBAYES, a high-level generator system for data analysis programs from statistical models. A statistical model specifies the properties for each problem variable (i.e., observation or parameter) and its dependencies in the form of a probability distribution. It is thus a fully declarative problem description, similar in spirit to a set of differential equations. From this model, AUTOBAYES generates optimized and fully commented C/C++ code which can be linked dynamically into the Matlab and Octave environments. Code is generated by schema-guided deductive synthesis. A schema consists of a code template and applicability constraints which are checked against the model during synthesis using theorem proving technology. AUTOBAYES augments schema-guided synthesis by symbolic-algebraic computation and can thus derive closed-form solutions for many problems. In this paper, we outline the AUTOBAYES system, its theoretical foundations in Bayesian probability theory, and its application by means of a detailed example.

[1]  James L. McClelland Explorations In Parallel Distributed Processing , 1988 .

[2]  Wray L. Buntine,et al.  Towards automated synthesis of data mining programs , 1999, KDD '99.

[3]  Jon M. Jenkins,et al.  CCD photometry tests for a mission to detect Earth-sized planets in the extended solar neighborhood , 2000, Astronomical Telescopes + Instrumentation.

[4]  Judea Pearl,et al.  Chapter 2 – BAYESIAN INFERENCE , 1988 .

[5]  Wray L. Buntine Operations for Learning with Graphical Models , 1994, J. Artif. Intell. Res..

[6]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[7]  Douglas R. Smith,et al.  Planware-domain-specific synthesis of high-performance schedulers , 1998, Proceedings 13th IEEE International Conference on Automated Software Engineering (Cat. No.98EX239).

[8]  Malcolm Murphy,et al.  Octave: A Free, High-Level Language for Mathematics , 1997 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Eugene Miya,et al.  On "Software engineering" , 1985, SOEN.

[11]  Michael R. Lowry,et al.  Deductive Composition of Astronomical Software from Subroutine Libraries , 1994, CADE.

[12]  James L. McClelland,et al.  Explorations in parallel distributed processing: a handbook of models, programs, and exercises , 1988 .

[13]  Ted J. Biggerstaff Reuse technologies and their niches , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[14]  S. Manson,et al.  Photoabsorption, Photoionization, and Photoelectron Spectroscopy , 1979 .

[15]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[16]  Douglas R. Smith,et al.  KIDS: A Semiautomatic Program Development System , 1990, IEEE Trans. Software Eng..

[17]  Geoffrey E. Hinton,et al.  The appeal of parallel distributed processing , 1986 .

[18]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[19]  William H. Press,et al.  Numerical recipes in C , 2002 .