Teaching machines to understand data science code by semantic enrichment of dataflow graphs

Your computer is continuously executing programs, but does it really understand them? Not in any meaningful sense. That burden falls upon human knowledge workers, who are increasingly asked to write and understand code. They deserve to have intelligent tools that reveal the connections between code and its subject matter. Towards this prospect, we develop an AI system that forms semantic representations of computer programs, using techniques from knowledge representation and program analysis. To create the representations, we introduce an algorithm for enriching dataflow graphs with semantic information. The semantic enrichment algorithm is undergirded by a new ontology language for modeling computer programs and a new ontology about data science, written in this language. Throughout the paper, we focus on code written by data scientists and we locate our work within a larger movement towards collaborative, open, and reproducible science.

[1]  Jim Q. Ning,et al.  Knowledge-based program analysis , 1990, IEEE Software.

[2]  Peter Selinger,et al.  Lecture notes on the lambda calculus , 2008, ArXiv.

[3]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[4]  J. Lambek,et al.  Introduction to higher order categorical logic , 1986 .

[5]  Jan Vitek,et al.  Evaluating the Design of the R Language - Objects and Functions for Data Analysis , 2012, ECOOP.

[6]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[7]  John C. Baez,et al.  Physics, Topology, Logic and Computation: A Rosetta Stone , 2009, 0903.0340.

[8]  Kush R. Varshney,et al.  Dataflow representation of data analyses: Toward a platform for collaborative data science , 2017, IBM J. Res. Dev..

[9]  Roy L. Crole,et al.  Categories for Types , 1994, Cambridge mathematical textbooks.

[10]  P. Selinger A Survey of Graphical Languages for Monoidal Categories , 2009, 0908.3347.

[11]  B. Coecke,et al.  Categories for the practising physicist , 2009, 0905.3010.

[12]  Dean P. Foster,et al.  VIF Regression: A Fast Regression Algorithm for Large Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[13]  Elliot Soloway,et al.  PROUST: Knowledge-Based Program Understanding , 1984, IEEE Transactions on Software Engineering.

[14]  John C. Reynolds,et al.  Using category theory to design implicit conversions and generic operators , 1980, Semantics-Directed Compiler Generation.

[15]  Masahito Hasegawa,et al.  Recursion from Cyclic Sharing: Traced Monoidal Categories and Models of Cyclic Lambda Calculi , 1997, TLCA.

[16]  Peter Selinger Categorical Structure of Asynchrony , 1999, MFPS.

[17]  Evan Patterson,et al.  Knowledge Representation in Bicategories of Relations , 2017, ArXiv.

[18]  John Cartmell,et al.  Generalised algebraic theories and contextual categories , 1986, Ann. Pure Appl. Log..

[19]  James H. Morris,et al.  Types are not sets , 1973, POPL.

[20]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[21]  A. Singleton,et al.  The Parkinson Progression Marker Initiative (PPMI) , 2011, Progress in Neurobiology.

[22]  Tom Leinster,et al.  Basic Category Theory , 2014, 1612.09375.

[23]  Jung-Hsien Chiang,et al.  Stratification of amyotrophic lateral sclerosis patients: a crowdsourcing approach , 2018, Scientific Reports.

[24]  S. Friend,et al.  Crowdsourcing biomedical research: leveraging communities as innovation engines , 2016, Nature Reviews Genetics.

[25]  John P. A. Ioannidis,et al.  A manifesto for reproducible science , 2017, Nature Human Behaviour.

[26]  Laura Scull,et al.  Amalgamations of Categories , 2009, Canadian Mathematical Bulletin.

[27]  David I. Spivak Ologs: A Categorical Framework for Knowledge Representation , 2011, PloS one.

[28]  Flemming Nielson,et al.  Principles of Program Analysis , 1999, Springer Berlin Heidelberg.

[29]  Bruce McMillin,et al.  Software engineering: What is it? , 2018, 2018 IEEE Aerospace Conference.

[30]  J. Wiley,et al.  Advanced R , 2016, Apress.

[31]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[32]  David I. Spivak Category Theory for the Sciences , 2014 .

[33]  Dan Marsden,et al.  Ambiguity and Incomplete Information in Categorical Models of Language , 2017, QPL.

[34]  A. Califano,et al.  Dialogue on Reverse‐Engineering Assessment and Methods , 2007, Annals of the New York Academy of Sciences.

[35]  Jessica Gurevitch,et al.  Meta-analysis and the science of research synthesis , 2018, Nature.

[36]  Graham J. Williams,et al.  PMML: An Open Standard for Sharing Models , 2009, R J..

[37]  Benjamin C. Pierce,et al.  Basic category theory for computer scientists , 1991, Foundations of computing.

[38]  M. Wacha,et al.  The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles , 2017 .

[39]  H. Pashler,et al.  Editors’ Introduction to the Special Section on Replicability in Psychological Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[40]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[41]  Bart Jacobs,et al.  Categorical Logic and Type Theory , 2001, Studies in logic and the foundations of mathematics.

[42]  John C. Baez,et al.  Categories in Control , 2014, 1405.6881.

[43]  Linda Mary Wills,et al.  Automated program recognition by graph parsing , 1992 .

[44]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[45]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[46]  Mehrnoosh Sadrzadeh,et al.  Lambek vs. Lambek: Functorial vector space semantics and string diagrams for Lambek calculus , 2013, Ann. Pure Appl. Log..

[47]  Edmund Robinson,et al.  Premonoidal categories and notions of computation , 1997, Mathematical Structures in Computer Science.

[48]  Thomas Vogt,et al.  Reinventing Discovery: The New Era of Networked Science , 2012 .

[49]  Emily Riehl,et al.  Categorical Homotopy Theory , 2014 .

[50]  Kush R. Varshney,et al.  Semantic Representation of Data Science Programs , 2018, IJCAI.

[51]  Heather A. Piwowar,et al.  The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles , 2018, PeerJ.

[52]  Ted J. Biggerstaff,et al.  Program understanding and the concept assignment problem , 1994, CACM.

[53]  Premkumar T. Devanbu,et al.  LaSSIE—a knowledge-based software information system , 1991, ICSE '90.

[54]  Matti Pirinen,et al.  Crowdsourced assessment of common genetic contribution to predicting anti-TNF treatment response in rheumatoid arthritis , 2016, Nature Communications.

[55]  José Meseguer,et al.  Order-Sorted Algebra I: Equational Deduction for Multiple Inheritance, Overloading, Exceptions and Partial Operations , 1992, Theor. Comput. Sci..